3D Human Mesh Estimation from Virtual Markers

Xiaoxuan Ma1  Jiajun Su1  Chunyu Wang 3  Wentao Zhu1  Yizhou Wang1, 2, 4
School of Computer Science, Center on Frontiers of Computing Studies, Peking University
Inst. for Artificial Intelligence, Peking University
Microsoft Research Asia
Nat’l Eng. Research Center of Visual Technology
{maxiaoxuan, sujiajun, wtzhu, yizhou.wang}@pku.edu.cn, [email protected]
Corresponding author
Abstract

Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker.

1 Introduction

Refer to caption
Figure 1: Mesh estimation results on four examples with different body shapes. Pose2Mesh [7] which uses 3D skeletons as the intermediate representation fails to predict accurate shapes. Our virtual marker-based method obtains accurate estimates.

3D human mesh estimation aims to estimate the 3D positions of the mesh vertices that are on the body surface. The task has attracted a lot of attention from the computer vision and computer graphics communities [3, 43, 30, 35, 51, 19, 25, 37, 27, 10] because it can benefit many applications such as virtual reality [15]. Recently, the deep learning-based methods [19, 7, 29] have significantly advanced the accuracy on the benchmark datasets.

The pioneer methods [51, 19] propose to regress the pose and shape parameters of the mesh models such as SMPL [36] directly from images. While straightforward, their accuracy is usually lower than the state-of-the-arts. The first reason is that the map** from the image features to the model parameters is highly non-linear and suffers from image-model misalignment [29]. Besides, existing mesh datasets [16, 54, 38, 28] are small and limited to simple laboratory environments due to the complex capturing process. The lack of sufficient training data severely limits its performance.

Recently, some works [26, 39] begin to formulate mesh estimation as a dense 3D keypoint detection task inspired by the success of volumetric pose estimation [47, 50, 65, 44, 59, 45]. For example, in [26, 39], the authors propose to regress the 3D positions of all vertices. However, it is computationally expensive because it has more than several thousand vertices. Moon and Lee [39] improve the efficiency by decomposing the 3D heatmaps into multiple 1D heatmaps at the cost of mediocre accuracy. Choi et al. [7] propose to first detect a sparser set of skeleton joints in the images, from which the dense 3D meshes are regressed by exploiting the mesh topology. The methods along this direction have attracted increasing attention [7, 29, 55] due to two reasons. First, the proxy task of 3D skeleton estimation can leverage the abundant 2D pose datasets which notably improves the accuracy. Second, mesh regression from the skeletons is efficient. However, important information about the body shapes is lost in extracting the 3D skeletons, which is largely overlooked previously. As a result, different types of body shapes, such as lean or obese, cannot be accurately estimated (see Figure 1).

The professional marker-based motion capture (mocap) method MoSh [35] places physical markers on the body surface and explore their subtle non-rigid motions to extract meshes with accurate shapes. However, the physical markers limit the approach to be used in laboratory environments. We are inspired to think whether we can identify a set of landmarks on the mesh as virtual markers, e.g., elbow and wrist, that can be detected from wild images, and allow to recover accurate body shapes? The desired virtual markers should satisfy several requirements. First, the number of markers should be much smaller than that of the mesh vertices so that we can use volumetric representations to efficiently estimate their 3D positions. Second, the markers should capture the mesh topology so that the intact mesh can be accurately regressed from them. Third, the virtual markers have distinguishable visual patterns so that they can be detected from images.

In this work, we present a learning algorithm based on archetypal analysis [12] to identify a subset of mesh vertices as the virtual markers that try to satisfy the above requirements to the best extent. Figure 2 shows that the learned virtual markers coarsely outline the body shape and pose which paves the way for estimating meshes with accurate shapes. Then we present a simple framework for 3D mesh estimation on top of the representation as shown in Figure 3. It first learns a 3D keypoint estimation network based on [47] to detect the 3D positions of the virtual markers. Then we recover the intact mesh simply by interpolating them. The interpolation weights are pre-trained in the representation learning step and will be adjusted by a light network based on the prediction confidences of the virtual markers for each image.

We extensively evaluate our approach on three benchmark datasets. It consistently outperforms the state-of-the-art methods on all of them. In particular, it achieves a significant gain on the SURREAL dataset [53] which has a variety of body shapes. Our ablation study also validates the advantages of the virtual marker representation in terms of recovering accurate shapes. Finally, the method shows decent generalization ability and generates visually appealing results for the wild images.

2 Related work

2.1 Optimization-based mesh estimation

Before deep learning dominates this field, 3D human mesh estimation [35, 2, 28, 41, 60] is mainly optimization-based, which optimizes the parameters of the human mesh models to match the observations. For example, Loper et al. [35] propose MoSh that optimizes the SMPL parameters to align the mesh with the 3D marker positions. It is usually used to get GT 3D meshes for benchmark datasets because of its high accuracy. Later works propose to optimize the model parameters or mesh vertices based on 2D image cues [2, 28, 41, 60, 11]. They extract intermediate representations such as 2D skeletons from the images and optimize the mesh model by minimizing the discrepancy between the model projection and the intermediate representations such as the 2D skeletons. These methods are usually sensitive to initialization and suffer from local optimum.

Refer to caption
Figure 2: Left: The learned virtual markers (blue balls) in the back and front views. The grey balls mean they are invisible in the front view. The virtual markers act similarly to physical body markers and approximately outline the body shape. Right: Mesh estimation results by our approach, from left to right are input image, estimated 3D mesh overlayed on the image, and three different viewpoints showing the estimated 3D mesh with our intermediate predicted virtual markers (blue balls), respectively.

2.2 Learning-based mesh estimation

Recently, most works follow the learning-based framework and have achieved promising results. Deep networks [51, 19, 25, 37, 27] are used to regress the SMPL parameters from image features. However, learning the map** from the image space to the parameter space is highly non-linear [39]. In addition, they suffer from the misalignment between the meshes and image pixels [62]. These problems make it difficult to learn an accurate yet generalizable model.

Some works propose to introduce proxy tasks to get intermediate representations first, ho** to alleviate the learning difficulty. In particular, intermediate representations of physical markers [61], IUV images [57, 62, 64, 63], body part segmentation masks [52, 24, 28, 40] and body skeletons [49, 7, 29, 55] have been proposed. In particular, THUNDR [61] first estimates the 3D locations of physical markers from images and then reconstructs the mesh from the 3D markers. The physical markers can be interpreted as a simplified representation of body shape and pose. Although it is very accurate, it cannot be applied to wild images without markers. In contrast, body skeleton is a popular human representation that can be robustly detected from wild images. Choi et al. [7] propose to first estimate the 3D skeletons, and then estimate the intact mesh from them. However, accurate body shapes are difficult to be recovered from the oversimplified 3D skeletons.

Our work belongs to the learning-based class and is related to works that use physical markers or skeletons as intermediate representations. But different from them, we propose a novel intermediate representation, named virtual markers, which is more expressive to reduce the ambiguity in pose and shape estimation than body skeletons and can be applied to wild images.

3 Method

In this section, we describe the details of our approach. First, Section 3.1 introduces how we learn the virtual marker representation from mocap data. Then we present the overall framework for mesh estimation from an image in Section 3.2. At last, Section 3.3 discusses the loss functions and training details.

3.1 The virtual marker representation

We represent a mesh by a vector of vertex positions 𝐱3M𝐱superscript3𝑀\mathbf{x}\in\mathbb{R}^{3M}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_M end_POSTSUPERSCRIPT where M𝑀Mitalic_M is the number of mesh vertices. Denote a mocap dataset such as [16] with N𝑁Nitalic_N meshes as 𝐗=[𝐱1,,𝐱N]3M×N𝐗subscript𝐱1subscript𝐱𝑁superscript3𝑀𝑁\overset{\frown}{\mathbf{X}}=[\mathbf{x}_{1},\,...,\,\mathbf{x}_{N}]\in\mathbb% {R}^{3M\times N}over⌢ start_ARG bold_X end_ARG = [ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_M × italic_N end_POSTSUPERSCRIPT. To unveil the latent structure among vertices, we reshape it to 𝐗3N×M𝐗superscript3𝑁𝑀\mathbf{X}\in\mathbb{R}^{3N\times M}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N × italic_M end_POSTSUPERSCRIPT with each column 𝐱i3Nsubscript𝐱𝑖superscript3𝑁\mathbf{x}_{i}\in\mathbb{R}^{3N}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N end_POSTSUPERSCRIPT representing all possible positions of the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT vertex in the dataset [16].

The rank of 𝐗𝐗\mathbf{X}bold_X is smaller than M𝑀Mitalic_M because the mesh representation is smooth and redundant where some vertices can be accurately reconstructed by the others. While it seems natural to apply PCA [18] to 𝐗𝐗\mathbf{X}bold_X to compute the eigenvectors as virtual markers for reconstructing others, there is no guarantee that the virtual markers correspond to the mesh vertices, making them difficult to be detected from images. Instead, we aim to learn K𝐾Kitalic_K virtual markers 𝐙=[𝐳1,,𝐳K]3N×K𝐙subscript𝐳1subscript𝐳𝐾superscript3𝑁𝐾\mathbf{Z}=[\mathbf{z}_{1},...,\mathbf{z}_{K}]\in\mathbb{R}^{3N\times K}bold_Z = [ bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_z start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N × italic_K end_POSTSUPERSCRIPT that try to satisfy the following two requirements to the greatest extent. First, they can accurately reconstruct the intact mesh 𝐗𝐗\mathbf{X}bold_X by their linear combinations: 𝐗=𝐙𝐀𝐗𝐙𝐀\mathbf{X}=\mathbf{Z}\mathbf{A}bold_X = bold_ZA, where 𝐀K×M𝐀superscript𝐾𝑀\mathbf{A}\in\mathbb{R}^{K\times M}bold_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_M end_POSTSUPERSCRIPT is a coefficient matrix that encodes the spatial relationship between the virtual markers and the mesh vertices. Second, they should have distinguishable visual patterns in images so that they can be easily detected from images. Ideally, they can be on the body surface as the meshes.

We apply archetypal analysis [12, 4] to learn 𝐙𝐙\mathbf{Z}bold_Z by minimizing a reconstruction error with two additional constraints: (1) each vertex 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be reconstructed by convex combinations of 𝐙𝐙\mathbf{Z}bold_Z, and (2) each marker 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should be convex combinations of the mesh vertices 𝐗𝐗\mathbf{X}bold_X:

min𝜶iΔKfor 1iM,𝜷jΔMfor 1jK𝐗𝐗𝐁𝐀F2,subscriptsubscript𝜶𝑖subscriptΔ𝐾𝑓𝑜𝑟1𝑖𝑀subscript𝜷𝑗subscriptΔ𝑀𝑓𝑜𝑟1𝑗𝐾subscriptsuperscriptnorm𝐗𝐗𝐁𝐀2𝐹\min_{\begin{subarray}{c}\bm{\alpha}_{i}\in{\Delta}_{K}\,for\,1\leq i\leq M,\\ \bm{\beta}_{j}\in{\Delta}_{M}\,for\,1\leq j\leq K\end{subarray}}||\mathbf{X}-% \mathbf{X}\mathbf{B}\mathbf{A}||^{2}_{F},\\ roman_min start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_f italic_o italic_r 1 ≤ italic_i ≤ italic_M , end_CELL end_ROW start_ROW start_CELL bold_italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT italic_f italic_o italic_r 1 ≤ italic_j ≤ italic_K end_CELL end_ROW end_ARG end_POSTSUBSCRIPT | | bold_X - bold_XBA | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , (1)

where 𝐀=[𝜶1,,𝜶M]K×M𝐀subscript𝜶1subscript𝜶𝑀superscript𝐾𝑀\mathbf{A}=[\bm{\alpha}_{1},...,\bm{\alpha}_{M}]\in\mathbb{R}^{K\times M}bold_A = [ bold_italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_α start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_M end_POSTSUPERSCRIPT, each 𝜶𝜶\bm{\alpha}bold_italic_α resides in the simplex ΔK{𝜶Ks.t.𝜶0and||𝜶||1=1}{\Delta}_{K}\triangleq\{\bm{\alpha}\in\mathbb{R}^{K}\,\mathrm{s.t.}\,\bm{% \alpha}\succeq 0\,\text{and}\,{||\bm{\alpha}||}_{1}=1\}roman_Δ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ≜ { bold_italic_α ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT roman_s . roman_t . bold_italic_α ⪰ 0 and | | bold_italic_α | | start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 }, and 𝐁=[𝜷1,,𝜷K]M×K𝐁subscript𝜷1subscript𝜷𝐾superscript𝑀𝐾\mathbf{B}=[\bm{\beta}_{1},...,\bm{\beta}_{K}]\in\mathbb{R}^{M\times K}bold_B = [ bold_italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_β start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_K end_POSTSUPERSCRIPT, 𝜷jΔMsubscript𝜷𝑗subscriptΔ𝑀\bm{\beta}_{j}\in{\Delta}_{M}bold_italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ roman_Δ start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT. We adopt Active-set algorithm [4] to solve objective (1) and obtain the learned virtual markers 𝐙=𝐗𝐁3N×K𝐙𝐗𝐁superscript3𝑁𝐾\mathbf{Z}=\mathbf{X}\mathbf{B}\in\mathbb{R}^{3N\times K}bold_Z = bold_XB ∈ blackboard_R start_POSTSUPERSCRIPT 3 italic_N × italic_K end_POSTSUPERSCRIPT. As shown in [12, 4], the two constraints encourage the virtual markers 𝐙𝐙\mathbf{Z}bold_Z to unveil the latent structure among vertices, therefore they learn to be close to the extreme points of the mesh and located on the body surface as much as possible.

Refer to caption
Figure 3: Overview of our framework. Given an input image 𝐈𝐈\mathbf{I}bold_I, it first estimates the 3D positions 𝐏^^𝐏\hat{\mathbf{P}}over^ start_ARG bold_P end_ARG of the virtual markers. Then we update the coefficient matrix 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG based on the estimation confidence scores 𝐂𝐂\mathbf{C}bold_C of the virtual markers. Finally, the complete human mesh can be simply recovered by linear multiplication 𝐌^=𝐏^𝐀^^𝐌^𝐏^𝐀\hat{\mathbf{M}}=\hat{\mathbf{P}}\hat{\mathbf{A}}over^ start_ARG bold_M end_ARG = over^ start_ARG bold_P end_ARG over^ start_ARG bold_A end_ARG.
Type Formula Reconst. Error (mm) \downarrow
Original 𝐗𝐗𝐁𝐀F2subscriptsuperscriptnorm𝐗𝐗𝐁𝐀2𝐹||\mathbf{X}-\mathbf{X}\mathbf{B}\mathbf{A}||^{2}_{F}| | bold_X - bold_XBA | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT 11.67
Symmetric 𝐗𝐗𝐁~sym𝐀~symF2subscriptsuperscriptnorm𝐗𝐗superscript~𝐁𝑠𝑦𝑚superscript~𝐀𝑠𝑦𝑚2𝐹||\mathbf{X}-\mathbf{X}\widetilde{\mathbf{B}}^{sym}\widetilde{\mathbf{A}}^{sym% }||^{2}_{F}| | bold_X - bold_X over~ start_ARG bold_B end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT 10.98
Table 1: The reconstruction errors using the original and the symmetric sets of markers on the H3.6M dataset [16], respectively. The errors are small indicating that they are sufficiently expressive and can reconstruct all vertices accurately.

Post-processing. Since human body is left-right symmetric, we adjust 𝐙𝐙\mathbf{Z}bold_Z to reflect the property. We first replace each 𝐳i𝐙subscript𝐳𝑖𝐙\mathbf{z}_{i}\in\mathbf{Z}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_Z by its nearest vertex on the mesh and obtain 𝐙~3×K~𝐙superscript3𝐾\widetilde{\mathbf{Z}}\in\mathbb{R}^{3\times K}over~ start_ARG bold_Z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_K end_POSTSUPERSCRIPT. This step allows us to compute the left or right counterpart of each marker. Then we replace the markers in the right body with the symmetric vertices in the left body and obtain the symmetric markers 𝐙~sym3×Ksuperscript~𝐙𝑠𝑦𝑚superscript3𝐾\widetilde{\mathbf{Z}}^{sym}\in\mathbb{R}^{3\times K}over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_K end_POSTSUPERSCRIPT. Finally we update 𝐁𝐁\mathbf{B}bold_B and 𝐀𝐀\mathbf{A}bold_A by minimizing 𝐗𝐗𝐁~sym𝐀~symF2subscriptsuperscriptnorm𝐗𝐗superscript~𝐁𝑠𝑦𝑚superscript~𝐀𝑠𝑦𝑚2𝐹||\mathbf{X}-\mathbf{X}\widetilde{\mathbf{B}}^{sym}\widetilde{\mathbf{A}}^{sym% }||^{2}_{F}| | bold_X - bold_X over~ start_ARG bold_B end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT subject to 𝐙~sym=𝐗𝐁~symsuperscript~𝐙𝑠𝑦𝑚𝐗superscript~𝐁𝑠𝑦𝑚\widetilde{\mathbf{Z}}^{sym}=\mathbf{X}\widetilde{\mathbf{B}}^{sym}over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT = bold_X over~ start_ARG bold_B end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT. More details are elaborated in the supplementary.

Figure 2 shows the virtual markers learned on the mocap dataset [16] after post-processing. They are similar to the physical markers and approximately outline the body shape which agrees with our expectations. They are roughly evenly distributed on the surface of the body, and some of them are located close to the body keypoints, which have distinguishable visual patterns to be accurately detected. Table 1 shows the reconstruction errors of using original markers 𝐗𝐁𝐗𝐁\mathbf{X}\mathbf{B}bold_XB and the symmetric markers 𝐗𝐁~sym𝐗superscript~𝐁𝑠𝑦𝑚\mathbf{X}\widetilde{\mathbf{B}}^{sym}bold_X over~ start_ARG bold_B end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT. Both can reconstruct meshes accurately.

3.2 Mesh estimation framework

On top of the virtual markers, we present a simple yet effective framework for end-to-end 3D human mesh estimation from a single image. As shown in Figure 3, it consists of two branches. The first branch uses a volumetric CNN [47] to estimate the 3D positions 𝐏^^𝐏\hat{\mathbf{P}}over^ start_ARG bold_P end_ARG of the markers, and the second branch reconstructs the full mesh 𝐌^^𝐌\hat{\mathbf{M}}over^ start_ARG bold_M end_ARG by predicting a coefficient matrix 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG:

𝐌^=𝐏^𝐀^.^𝐌^𝐏^𝐀\small\hat{\mathbf{M}}=\hat{\mathbf{P}}\hat{\mathbf{A}}.over^ start_ARG bold_M end_ARG = over^ start_ARG bold_P end_ARG over^ start_ARG bold_A end_ARG . (2)

We will describe the two branches in more detail.

3D marker estimation. We train a neural network to estimate a 3D heatmap 𝐇^=[𝐇^1,,𝐇^K]K×D×H×W^𝐇subscript^𝐇1subscript^𝐇𝐾superscript𝐾𝐷𝐻𝑊\hat{\mathbf{H}}=[\hat{\mathbf{H}}_{1},\,...,\,\hat{\mathbf{H}}_{K}]\in\mathbb% {R}^{K\times D\times H\times W}over^ start_ARG bold_H end_ARG = [ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_D × italic_H × italic_W end_POSTSUPERSCRIPT from an image. The heatmap encodes per-voxel likelihood of each marker. There are D×H×W𝐷𝐻𝑊D\times H\times Witalic_D × italic_H × italic_W voxels in total which are used to discretize the 3D space. The 3D position 𝐏^z3subscript^𝐏𝑧superscript3\hat{\mathbf{P}}_{z}\in\mathbb{R}^{3}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT of each marker is computed as the center of mass of the corresponding heatmap 𝐇^zsubscript^𝐇𝑧\hat{\mathbf{H}}_{z}over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT [47] as follows:

𝐏^z=d=1Dh=1Hw=1W(d,h,w)𝐇^z(d,h,w).subscript^𝐏𝑧superscriptsubscript𝑑1𝐷superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊𝑑𝑤subscript^𝐇𝑧𝑑𝑤\small\hat{\mathbf{P}}_{z}=\sum_{d=1}^{D}\sum_{h=1}^{H}\sum_{w=1}^{W}(d,h,w)% \cdot\hat{\mathbf{H}}_{z}(d,h,w).over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ( italic_d , italic_h , italic_w ) ⋅ over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_d , italic_h , italic_w ) . (3)

The positions of all markers are represented as 𝐏^=[𝐏^1,𝐏^2,,𝐏^K]^𝐏subscript^𝐏1subscript^𝐏2subscript^𝐏𝐾\hat{\mathbf{P}}=[\hat{\mathbf{P}}_{1},\hat{\mathbf{P}}_{2},\cdots,\hat{% \mathbf{P}}_{K}]over^ start_ARG bold_P end_ARG = [ over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ].

Interpolation. Ideally, if we have accurate estimates for all virtual markers 𝐏^^𝐏\hat{\mathbf{P}}over^ start_ARG bold_P end_ARG, then we can recover the complete mesh by simply multiplying 𝐏^^𝐏\hat{\mathbf{P}}over^ start_ARG bold_P end_ARG with a fixed coefficient matrix 𝐀~symsuperscript~𝐀𝑠𝑦𝑚\widetilde{\mathbf{A}}^{sym}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT with sufficient accuracy as validated in Table 1. However, in practice, some markers may have large estimation errors because they may be occluded in the monocular setting. Note that this happens frequently. For example, the markers in the back will be occluded when a person is facing the camera. As a result, inaccurate markers positions may bring large errors to the final mesh if we directly multiply them with the fixed matrix 𝐀~symsuperscript~𝐀𝑠𝑦𝑚\widetilde{\mathbf{A}}^{sym}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT.

Our solution is to rely more on those accurately detected markers. To that end, we propose to update the coefficient matrix based on the estimation confidence scores of the markers. In practice, we simply take the heatmap score at the estimated positions of each marker, i.e. 𝐇^z(𝐏^z)subscript^𝐇𝑧subscript^𝐏𝑧\hat{\mathbf{H}}_{z}(\hat{\mathbf{P}}_{z})over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ), and feed them to a single fully-connected layer to obtain the coefficient matrix 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG. Then the mesh is reconstructed by 𝐌^=𝐏^𝐀^^𝐌^𝐏^𝐀\hat{\mathbf{M}}=\hat{\mathbf{P}}\hat{\mathbf{A}}over^ start_ARG bold_M end_ARG = over^ start_ARG bold_P end_ARG over^ start_ARG bold_A end_ARG.

3.3 Training

We train the whole network end-to-end in a supervised way. The overall loss function is defined as:

=λvmvm+λcconf+λmmesh.subscript𝜆𝑣𝑚subscript𝑣𝑚subscript𝜆𝑐subscript𝑐𝑜𝑛𝑓subscript𝜆𝑚subscript𝑚𝑒𝑠\displaystyle\mathcal{L}=\lambda_{vm}\mathcal{L}_{vm}+\lambda_{c}\mathcal{L}_{% conf}+\lambda_{m}\mathcal{L}_{mesh}.caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_v italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_v italic_m end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_s italic_h end_POSTSUBSCRIPT . (4)

Virtual marker loss. We define vmsubscript𝑣𝑚\mathcal{L}_{vm}caligraphic_L start_POSTSUBSCRIPT italic_v italic_m end_POSTSUBSCRIPT as the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distance between the predicted 3D virtual markers 𝐏^^𝐏\hat{\mathbf{P}}over^ start_ARG bold_P end_ARG and the GT 𝐏^superscript^𝐏\hat{\mathbf{P}}^{*}over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as follows:

vm=𝐏^𝐏^1.subscript𝑣𝑚subscriptnorm^𝐏superscript^𝐏1\displaystyle\mathcal{L}_{vm}=\|\hat{\mathbf{P}}-\hat{\mathbf{P}}^{*}\|_{1}.caligraphic_L start_POSTSUBSCRIPT italic_v italic_m end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_P end_ARG - over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (5)

Note that it is easy to get GT markers 𝐏^superscript^𝐏\hat{\mathbf{P}}^{*}over^ start_ARG bold_P end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT from GT meshes as stated in Section 3.1 without additional manual annotations.

Confidence loss. We also require that the 3D heatmaps have reasonable shapes, therefore, the heatmap score at the voxel containing the GT marker position 𝐏^zsuperscriptsubscript^𝐏𝑧\hat{\mathbf{P}}_{z}^{*}over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT should have the maximum value as in the previous work [17]:

conf=z=1Klog(𝐇^z(𝐏^z)).subscript𝑐𝑜𝑛𝑓superscriptsubscript𝑧1𝐾𝑙𝑜𝑔subscript^𝐇𝑧superscriptsubscript^𝐏𝑧\displaystyle\mathcal{L}_{conf}=-\sum_{z=1}^{K}log(\hat{\mathbf{H}}_{z}(\hat{% \mathbf{P}}_{z}^{*})).caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_l italic_o italic_g ( over^ start_ARG bold_H end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( over^ start_ARG bold_P end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) . (6)

Mesh loss. Following [39], we define meshsubscript𝑚𝑒𝑠\mathcal{L}_{mesh}caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_s italic_h end_POSTSUBSCRIPT as a weighted sum of four losses:

mesh=vertex+pose+normal+λeedge.subscript𝑚𝑒𝑠subscript𝑣𝑒𝑟𝑡𝑒𝑥subscript𝑝𝑜𝑠𝑒subscript𝑛𝑜𝑟𝑚𝑎𝑙subscript𝜆𝑒subscript𝑒𝑑𝑔𝑒\displaystyle\small\mathcal{L}_{mesh}=\mathcal{L}_{vertex}+\mathcal{L}_{pose}+% \mathcal{L}_{normal}+\lambda_{e}\mathcal{L}_{edge}.caligraphic_L start_POSTSUBSCRIPT italic_m italic_e italic_s italic_h end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_v italic_e italic_r italic_t italic_e italic_x end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT . (7)
  • Vertex coordinate loss. We adopt L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss between predicted 3D mesh coordinates 𝐌^^𝐌\hat{\mathbf{M}}over^ start_ARG bold_M end_ARG with GT mesh 𝐌^superscript^𝐌\hat{\mathbf{M}}^{*}over^ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as:

    vertex=𝐌^𝐌^1.subscript𝑣𝑒𝑟𝑡𝑒𝑥subscriptnorm^𝐌superscript^𝐌1\displaystyle\mathcal{L}_{vertex}=\|\hat{\mathbf{M}}-\hat{\mathbf{M}}^{*}\|_{1}.caligraphic_L start_POSTSUBSCRIPT italic_v italic_e italic_r italic_t italic_e italic_x end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_M end_ARG - over^ start_ARG bold_M end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . (8)
  • Pose loss. We use L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss between the 3D landmark joints regressed from mesh 𝐌^𝒥^𝐌𝒥\hat{\mathbf{M}}\mathcal{J}over^ start_ARG bold_M end_ARG caligraphic_J and the GT joints 𝐉^superscript^𝐉\hat{\mathbf{J}}^{*}over^ start_ARG bold_J end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as:

    pose=𝐌^𝒥𝐉^1,subscript𝑝𝑜𝑠𝑒subscriptnorm^𝐌𝒥superscript^𝐉1\displaystyle\mathcal{L}_{pose}=\|\hat{\mathbf{M}}\mathcal{J}-\hat{\mathbf{J}}% ^{*}\|_{1},caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT = ∥ over^ start_ARG bold_M end_ARG caligraphic_J - over^ start_ARG bold_J end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (9)

    where 𝒥M×J𝒥superscript𝑀𝐽\mathcal{J}\in\mathbb{R}^{M\times J}caligraphic_J ∈ blackboard_R start_POSTSUPERSCRIPT italic_M × italic_J end_POSTSUPERSCRIPT is a pre-defined joint regression matrix in SMPL model [2].

  • Surface losses. To improve surface smoothness [56], we supervise the normal vector of a triangle face with GT normal vectors by normalsubscript𝑛𝑜𝑟𝑚𝑎𝑙\mathcal{L}_{normal}caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT and the edge length of the predicted mesh with GT length by edgesubscript𝑒𝑑𝑔𝑒\mathcal{L}_{edge}caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT:

    normal=f{i,j}f|𝐌^i𝐌^j𝐌^i𝐌^j2,𝐧^f|,subscript𝑛𝑜𝑟𝑚𝑎𝑙subscript𝑓subscript𝑖𝑗𝑓subscript^𝐌𝑖subscript^𝐌𝑗subscriptnormsubscript^𝐌𝑖subscript^𝐌𝑗2superscriptsubscript^𝐧𝑓\displaystyle\mathcal{L}_{normal}=\sum_{f}\sum_{\{i,j\}\subset f}\left|\left<% \frac{\hat{\mathbf{M}}_{i}-\hat{\mathbf{M}}_{j}}{\|\hat{\mathbf{M}}_{i}-\hat{% \mathbf{M}}_{j}\|_{2}},\hat{\mathbf{n}}_{f}^{*}\right>\right|,caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT { italic_i , italic_j } ⊂ italic_f end_POSTSUBSCRIPT | ⟨ divide start_ARG over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⟩ | , (10)
    edge=f{i,j}f|𝐌^i𝐌^j2𝐌^i𝐌^j2|.subscript𝑒𝑑𝑔𝑒subscript𝑓subscript𝑖𝑗𝑓subscriptnormsubscript^𝐌𝑖subscript^𝐌𝑗2subscriptnormsuperscriptsubscript^𝐌𝑖superscriptsubscript^𝐌𝑗2\displaystyle\mathcal{L}_{edge}=\sum_{f}\sum_{\{i,j\}\subset f}\left|\|\hat{% \mathbf{M}}_{i}-\hat{\mathbf{M}}_{j}\|_{2}-\|\hat{\mathbf{M}}_{i}^{*}-\hat{% \mathbf{M}}_{j}^{*}\|_{2}\right|.caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT { italic_i , italic_j } ⊂ italic_f end_POSTSUBSCRIPT | ∥ over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - ∥ over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | .

    where f𝑓fitalic_f and 𝐧^fsuperscriptsubscript^𝐧𝑓\hat{\mathbf{n}}_{f}^{*}over^ start_ARG bold_n end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote a triangle face in the mesh and its GT unit normal vector, respectively. 𝐌^isubscript^𝐌𝑖\hat{\mathbf{M}}_{i}over^ start_ARG bold_M end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT vertex of 𝐌^^𝐌\hat{\mathbf{M}}over^ start_ARG bold_M end_ARG. denotes GT.

4 Experiments

Method Intermediate H3.6M 3DPW
Representation MPVE\downarrow MPJPE\downarrow PA-MPJPE\downarrow MPVE\downarrow MPJPE\downarrow PA-MPJPE\downarrow
Arnab et al. [1] CVPR’19 2D skeleton - 77.8 54.3 - - 72.2
HMMR [20] CVPR’19 - - - 56.9 139.3 116.5 72.6
DSD-SATN [49] ICCV’19 3D skeleton - 59.1 42.4 - - 69.5
VIBE [23] CVPR’20 - - 65.9 41.5 99.1 82.9 51.9
TCMR [6] CVPR’21 - - 62.3 41.1 102.9 86.5 52.7
MAED [55] ICCV’21 3D skeleton - 56.3 38.7 92.6 79.1 45.7
SMPLify [2] ECCV’16 2D skeleton - - 82.3 - - -
HMR [19] CVPR’18 - 96.1 88.0 56.8 152.7 130.0 81.3
GraphCMR [26] CVPR’19 3D vertices - - 50.1 - - 70.2
SPIN [25] ICCV’19 - - - 41.1 116.4 96.9 59.2
DenseRac [57] ICCV’19 IUV image - 76.8 48.0 - - -
DecoMR [62] CVPR’20 IUV image - 60.6 39.3 - - -
ExPose [9] ECCV’20 - - - - - 93.4 60.7
Pose2Mesh [7] ECCV’20 3D skeleton 85.3 64.9 46.3 106.3 88.9 58.3
I2L-MeshNet [39] ECCV’20 3D vertices 65.1 55.7 41.1 110.1 93.2 57.7
PC-HMR [37] AAAI’21 3D skeleton - - - 108.6 87.8 66.9
HybrIK [29] CVPR’21 3D skeleton 65.7 54.4 34.5 86.5 74.1 45.0
METRO [32] CVPR’21 3D vertices - 54.0 36.7 88.2 77.1 47.9
ROMP [48] ICCV’21 - - - - 108.3 91.3 54.9
Mesh Graphormer[33] ICCV’21 3D vertices - 51.2 34.5 87.7 74.7 45.6
PARE [24] ICCV’21 Segmentation - - - 88.6 74.5 46.5
THUNDR [61] ICCV’21 3D markers - 55.0 39.8 88.0 74.8 51.5
PyMaf [64] ICCV’21 IUV image - 57.7 40.5 110.1 92.8 58.9
ProHMR [27] ICCV’21 - - - 41.2 - - 59.8
OCHMR [21] CVPR’22 2D heatmap - - - 107.1 89.7 58.3
3DCrowdNet [8] CVPR’22 3D skeleton - - - 98.3 81.7 51.5
CLIFF [31] ECCV’22 - - 47.1 32.7 81.2 69.0 43.0
FastMETRO [5] ECCV’22 3D vertices - 52.2 33.7 84.1 73.5 44.6
VisDB [58] ECCV’22 3D vertices - 51.0 34.5 85.5 73.5 44.9
Ours Virtual marker 58.0 47.3 32.0 77.9 67.5 41.3
Table 2: Comparison to the state-of-the-arts on H3.6M [16] and 3DPW [54] datasets. means using temporal cues. The methods are not strictly comparable because they may have different backbones and training datasets. We provide the numbers only to show proof-of-concept results.

4.1 Datasets and metrics

H3.6M [16]. We use (S1, S5, S6, S7, S8) for training and (S9, S11) for testing. As in [19, 7, 32, 33], we report MPJPE and PA-MPJPE for poses that are derived from the estimated meshes. We also report Mean Per Vertex Error (MPVE) for the whole mesh.

3DPW [54] is collected in natural scenes. Following the previous works [32, 33, 24, 61], we use the train set of 3DPW to learn the model and evaluate on the test set. The same evaluation metrics as H3.6M are used.

SURREAL [53] is a large-scale synthetic dataset with GT SMPL annotations and has diverse samples in terms of body shapes, backgrounds, etc. We use its training set to train a model and evaluate the test split following [7].

4.2 Implementation Details

We learn 64646464 virtual markers on the H3.6M [16] training set. We use the same set of markers for all datasets instead of learning a separate set on each dataset. Following [19, 7, 39, 61, 26, 23, 33, 32], we conduct mix-training by using MPI-INF-3DHP [38], UP-3D [28], and COCO [34] training set for experiments on the H3.6M and 3DPW datasets. We adapt a 3D pose estimator [47] with HRNet-W48 [46] as the image feature backbone for estimating the 3D virtual markers. We set the number of voxels in each dimension to be 64646464, i.e. D=H=W=64𝐷𝐻𝑊64D=H=W=64italic_D = italic_H = italic_W = 64 for 3D heatmaps. Following [19, 26, 39], we crop every single human region from the input image and resize it to 256×256256256256\times 256256 × 256. We use Adam [22] optimizer to train the whole framework for 40404040 epochs with a batch size of 32323232. The learning rates for the two branches are set to 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 1×1031superscript1031\times 10^{-3}1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, respectively, which are decreased by half after the 30thsuperscript30𝑡30^{th}30 start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT epoch. Please refer to the supplementary for more details.

Method Intermediate MPVE\downarrow MPJPE\downarrow PA-MPJPE\downarrow
Representation
HMR [19] CVPR’18 - 85.1 73.6 55.4
BodyNet [52] ECCV’18 Skel. + Seg. 65.8 - -
GraphCMR [26] CVPR’19 3D vertices 103.2 87.4 63.2
SPIN [25] ICCV’19 - 82.3 66.7 43.7
DecoMR [62] CVPR’20 IUV image 68.9 52.0 43.0
Pose2Mesh [7] ECCV’20 3D skeleton 68.8 56.6 39.6
PC-HMR [37] AAAI’21 3D skeleton 59.8 51.7 37.9
DynaBOA [14] TPAMI’22 - 70.7 55.2 34.0
Ours Virtual marker 44.7 36.9 28.9
Table 3: Comparison to the state-of-the-arts on SURREAL [53] dataset. means training on the test split with 2D supervisions. “Skel. + Seg.” means using skeleton and segmentation together.

4.3 Comparison to the State-of-the-arts

Results on H3.6M. Table 2 compares our approach to the state-of-the-art methods on the H3.6M dataset. Our method achieves competitive or superior performance. In particular, it outperforms the methods that use skeletons (Pose2Mesh [7], DSD-SATN [49]), body markers (THUNDR) [61], or IUV image [62, 64] as proxy representations, demonstrating the effectiveness of the virtual marker representation.

Results on 3DPW. We compare our method to the state-of-the-art methods on the 3DPW dataset in Table 2. Our approach achieves state-of-the-art results among all the methods, validating the advantages of the virtual marker representation over the skeleton representation used in Pose2Mesh [7], DSD-SATN [49], and other representations like IUV image used in PyMAF [64]. In particular, our approach outperforms I2L-MeshNet [39], METRO [32], and Mesh Graphormer [33] by a notable margin, which suggests that virtual markers are more suitable and effective representations than detecting all vertices directly as most of them are not discriminative enough to be accurately detected.

Results on SURREAL. This dataset has more diverse samples in terms of body shapes. The results are shown in Table 3. Our approach outperforms the state-of-the-art methods by a notable margin, especially in terms of MPVE. Figure 1 shows some challenging cases without cherry-picking. The skeleton representation loses the body shape information so the method [7] can only recover mean shapes. In contrast, our approach generates much more accurate mesh estimation results.

No. Intermediate MPVE\downarrow
Representation H3.6M SURREAL
(a) Skeleton 64.4 53.6
(b) Rand virtual marker 63.0 50.1
(c) Virtual marker 58.0 44.7
Table 4: Ablation study of the virtual marker representation for our approach on H3.6M and SURREAL datasets. “Skeleton” means the sparse landmark joint representation is used. “Rand virtual marker” means the virtual markers are randomly selected from all the vertices without learning. (c) is our method, where the learned virtual markers are used.

4.4 Ablation study

Virtual marker representation. We compare our method to two baselines in Table 4. First, in baseline (a), we replace the virtual markers of our method with the skeleton representation. The rest are kept the same as ours (c). Our method achieves a much lower MPVE than the baseline (a), demonstrating that the virtual markers help to estimate body shapes more accurately than the skeletons. In baseline (b), we randomly sample 64646464 from the 6890689068906890 mesh vertices as virtual markers. We repeat the experiment five times and report the average number. We can see that the result is worse than ours, which is because the randomly selected vertices may not be expressive to reconstruct the other vertices or can not be accurately detected from images as they lack distinguishable visual patterns. The results validate the effectiveness of our learning strategy.

Refer to caption
Figure 4: Mesh estimation results of different methods on H3.6M test set. Our method with virtual marker representation gets better shape estimation results than Pose2Mesh which uses skeleton representation. Note the waistline of the body and the thickness of the arm.

Figure 1 shows some qualitative results on the SURREAL test set. The meshes estimated by the baseline which uses skeleton representation, i.e. Pose2Mesh [7], have inaccurate body shapes. This is reasonable because the skeleton is oversimplified and has very limited capability to recover shapes. Instead, it implicitly learns a mean shape for the whole training dataset. In contrast, the mesh estimated by using virtual markers has much better quality due to its strong representation power and therefore can handle different body shapes elegantly. Figure 4 also shows some qualitative results on the H3.6M test set. For clarity, we draw the intermediate representation (blue balls) in it as well.

Refer to caption
Figure 5: Visualization of the learned virtual markers of different numbers of K=16,32,96𝐾163296K=16,32,96italic_K = 16 , 32 , 96, from left to right, respectively.

Number of virtual markers. We evaluate how the number of virtual markers affects estimation quality on H3.6M [16] dataset. Figure 5 visualizes the learned virtual markers, which are all located on the body surface and close to the extreme points of the mesh. This is expected as mentioned in Section 3.1. Table 5 (GT) shows the mesh reconstruction results when we have GT 3D positions of the virtual markers in objective (1). When we increase the number of virtual markers, both mesh reconstruction error (MPVE) and the regressed landmark joint error (MPJPE) steadily decrease. This is expected because using more virtual markers improves the representation power. However, using more virtual markers cannot guarantee smaller estimation errors when we need to estimate the virtual marker positions from images as in our method. This is because the additional virtual markers may have large estimation errors which affect the mesh estimation result. The results are shown in Table 5 (Det). Increasing the number of virtual markers K𝐾Kitalic_K steadily reduces the MPVE errors when K𝐾Kitalic_K is smaller than 96969696. However, if we keep increasing K𝐾Kitalic_K, the error begins to increase. This is mainly because some of the newly introduced virtual markers are difficult to detect from images and therefore bring errors to mesh estimation.

K𝐾Kitalic_K GT Det
MPVE\downarrow MPJPE\downarrow MPVE\downarrow MPJPE\downarrow
16 46.8 39.8 58.7 47.8
32 20.1 14.2 58.2 48.3
64 11.0 7.5 58.0 47.3
96 9.9 5.6 59.6 48.2
Table 5: Ablation study of the different number of virtual markers (K𝐾Kitalic_K) on H3.6M [16] dataset. (GT) Mesh reconstruction results when GT 3D positions of the virtual markers are used in objective (1). (Det) Mesh estimation results obtained by our proposed framework when we use different numbers of virtual markers (K𝐾Kitalic_K).
Refer to caption
Figure 6: Mesh estimation comparison results when using (a) fixed coefficient matrix 𝐀~symsuperscript~𝐀𝑠𝑦𝑚\widetilde{\mathbf{A}}^{sym}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT, and (b) updated 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG. Please zoom in to better see the details.
Refer to caption
Figure 7: Top: Meshes estimated by our approach on images from 3DPW test set. The rightmost case in the dashed box shows a typical failure. Bottom: Meshes estimated by our approach on Internet images with challenging cases (extreme shapes or in a long dress).

Coefficient matrix. We compare our method to a baseline which uses the fixed coefficient matrix 𝐀~symsuperscript~𝐀𝑠𝑦𝑚\widetilde{\mathbf{A}}^{sym}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT. We show the quality comparison in Figure 6. We can see that the estimated mesh by a fixed coefficient matrix (a) has mostly correct pose and shape but there are also some artifacts on the mesh while using the updated coefficient matrix (b) can get better mesh estimation results. As shown in Table 6, using a fixed coefficient matrix gets larger MPVE and MPJPE errors than using the updated coefficient matrix. This is caused by the estimation errors of virtual markers when occlusion happens, which is inevitable since the virtual markers on the back will be self-occluded by the front body. As a result, inaccurate marker positions would bring large errors to the final mesh estimates if we directly use the fixed matrix.

No. Method Fixed 𝐀~symsuperscript~𝐀𝑠𝑦𝑚\widetilde{\mathbf{A}}^{sym}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT Updated 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG MPVE\downarrow MPJPE\downarrow
(a) Ours (fixed) 64.7 51.6
(b) Ours 58.0 47.3
Table 6: Ablation study of the coefficient matrix for our approach on H3.6M dataset. “fixed” means using the fixed coefficient matrix 𝐀~symsuperscript~𝐀𝑠𝑦𝑚\widetilde{\mathbf{A}}^{sym}over~ start_ARG bold_A end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT to reconstruct the mesh.

4.5 Qualitative Results

Figure 7 (top) presents some meshes estimated by our approach on natural images from the 3DPW test set. The rightmost case shows a typical failure where our method has a wrong pose estimate of the left leg due to heavy occlusion. We can see that the failure is constrained to the local region and the rest of the body still gets accurate estimates. We further analyze how inaccurate virtual markers would affect the mesh estimation, i.e. when part of human body is occluded or truncated. According to the finally learned coefficient matrix 𝐀^^𝐀\mathbf{\hat{A}}over^ start_ARG bold_A end_ARG of our model, we highlight the relationship weights among virtual markers and all vertices in Figure 8. We can see that our model actually learns local and sparse dependency between each vertex and the virtual markers, e.g. for each vertex, the virtual markers that contribute the most are in a near range as shown in Figure 8 (b). Therefore, in inference, if a virtual marker has inaccurate position estimation due to occlusion or truncation, the dependent vertices may have inaccurate estimates, while the rest will be barely affected. Figure 2 (right) shows more examples where occlusion or truncation occurs, and our method can still get accurate or reasonable estimates robustly. Note that when truncation occurs, our method still guesses the positions of the truncated virtual markers.

Refer to caption
Figure 8: (a) For each virtual marker (represented by a star), we highlight the top 30 most affected vertices (represented by a colored dot) based on average coefficient matrix 𝐀^^𝐀\mathbf{\hat{A}}over^ start_ARG bold_A end_ARG. (b) For each vertex (dot), we highlight the top 3 virtual markers (star) that contribute the most. We can see that the dependency has a strong locality which improves the robustness when some virtual markers cannot be accurately detected.

Figure 7 (bottom) shows our estimated meshes on challenging cases, which indicates the strong generalization ability of our model on diverse postures and actions in natural scenes. Please refer to the supplementary for more quality results. Note that since the datasets do not provide supervision of head orientation, face expression, hands, or feet, the estimates of these parts are just in canonical poses inevitably. Apart from that, most errors are due to inaccurate 3D virtual marker estimation which may be addressed using more powerful estimators or more diverse training datasets in the future.

5 Conclusion

In this paper, we present a novel intermediate representation Virtual Marker, which is more expressive than the prevailing skeleton representation and more accessible than physical markers. It can reconstruct 3D meshes more accurately and efficiently, especially in handling diverse body shapes. Besides, the coefficient matrix in the virtual marker representation encodes spatial relationships among mesh vertices which allows the method to implicitly explore structure priors of human body. It achieves better mesh estimation results than the state-of-the-art methods and shows advanced generalization potential in spite of its simplicity.

Acknowledgement

This work was supported by MOST-2022ZD0114900 and NSFC-62061136001.

References

  • [1] Anurag Arnab, Carl Doersch, and Andrew Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, pages 3395–3404, 2019.
  • [2] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, pages 561–578, 2016.
  • [3] Ronan Boulic, Pascal Bécheiraz, Luc Emering, and Daniel Thalmann. Integration of motion control techniques for virtual human and avatar real-time animation. In Proceedings of the ACM symposium on Virtual reality software and technology, pages 111–118, 1997.
  • [4] Yuansi Chen, Julien Mairal, and Zaid Harchaoui. Fast and robust archetypal analysis for representation learning. In CVPR, pages 1478–1485, 2014.
  • [5] Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In ECCV, 2022.
  • [6] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. In CVPR, pages 1964–1973, 2021.
  • [7] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV, pages 769–787, 2020.
  • [8] Hongsuk Choi, Gyeongsik Moon, JoonKyu Park, and Kyoung Mu Lee. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In CVPR, pages 1475–1484, June 2022.
  • [9] Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J Black. Monocular expressive body regression through body-driven attention. In ECCV, pages 20–40, 2020.
  • [10] Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, and Yizhou Wang. Gfpose: Learning 3d human pose prior with gradient fields. arXiv preprint arXiv:2212.08641, 2022.
  • [11] Enric Corona, Gerard Pons-Moll, Guillem Alenyà, and Francesc Moreno-Noguer. Learned vertex descent: a new direction for 3d human model fitting. In ECCV, pages 146–165. Springer, 2022.
  • [12] Adele Cutler and Leo Breiman. Archetypal analysis. Technometrics, 36(4):338–347, 1994.
  • [13] John C Gower. Generalized procrustes analysis. Psychometrika, 40(1):33–51, 1975.
  • [14] Shanyan Guan, **gwei Xu, Michelle Z He, Yunbo Wang, Bingbing Ni, and Xiaokang Yang. Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation. IEEE TPAMI, 2022.
  • [15] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. Towards accurate marker-less human shape and pose estimation over time. In 3DV, pages 421–430, 2017.
  • [16] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2013.
  • [17] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. In ICCV, pages 7718–7727, 2019.
  • [18] Ian T Jolliffe. Principal components in regression analysis. In Principal component analysis, pages 129–155. Springer, 1986.
  • [19] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
  • [20] Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. In CVPR, pages 5614–5623, 2019.
  • [21] Rawal Khirodkar, Shashank Tripathi, and Kris Kitani. Occluded human mesh recovery. In CVPR, pages 1715–1725, June 2022.
  • [22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [23] Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. Vibe: Video inference for human body pose and shape estimation. In CVPR, pages 5253–5263, 2020.
  • [24] Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. Pare: Part attention regressor for 3d human body estimation. In ICCV, pages 11127–11137, October 2021.
  • [25] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019.
  • [26] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, pages 4501–4510, 2019.
  • [27] Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. In ICCV, pages 11605–11614, October 2021.
  • [28] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In CVPR, pages 6050–6059, 2017.
  • [29] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
  • [30] Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. Pastanet: Toward human activity knowledge engine. In CVPR, pages 382–391, 2020.
  • [31] Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022.
  • [32] Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. In CVPR, pages 1954–1963, 2021.
  • [33] Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. In ICCV, pages 12939–12948, 2021.
  • [34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
  • [35] Matthew Loper, Naureen Mahmood, and Michael J Black. Mosh: Motion and shape capture from sparse markers. TOG, 33(6):1–13, 2014.
  • [36] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. TOG, 34(6):1–16, 2015.
  • [37] Tianyu Luan, Yali Wang, Junhao Zhang, Zhe Wang, Zhipeng Zhou, and Yu Qiao. Pc-hmr: Pose calibration for 3d human mesh recovery from 2d images/videos. In AAAI, pages 2269–2276, 2021.
  • [38] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, pages 506–516, 2017.
  • [39] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In ECCV, pages 752–768, 2020.
  • [40] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 3DV, pages 484–494. IEEE, 2018.
  • [41] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
  • [42] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In CVPR, pages 459–468, 2018.
  • [43] Liliana Lo Presti and Marco La Cascia. 3d skeleton-based human action classification: A survey. Pattern Recognition, 53:130–147, 2016.
  • [44] Haibo Qiu, Chunyu Wang, **gdong Wang, Naiyan Wang, and Wenjun Zeng. Cross view fusion for 3d human pose estimation. In ICCV, pages 4342–4351, 2019.
  • [45] Jiajun Su, Chunyu Wang, Xiaoxuan Ma, Wenjun Zeng, and Yizhou Wang. Virtualpose: Learning generalizable 3d human pose models from virtual data. In ECCV, pages 55–71. Springer, 2022.
  • [46] Ke Sun, Bin Xiao, Dong Liu, and **gdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
  • [47] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In ECCV, pages 529–545, 2018.
  • [48] Yu Sun, Qian Bao, Wu Liu, Yili Fu, Michael J Black, and Tao Mei. Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021.
  • [49] Yu Sun, Yun Ye, Wu Liu, Wenpeng Gao, Yili Fu, and Tao Mei. Human mesh recovery from monocular images via a skeleton-disentangled representation. In ICCV, pages 5349–5358, 2019.
  • [50] Hanyue Tu, Chunyu Wang, and Wenjun Zeng. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In ECCV, pages 197–212. Springer, 2020.
  • [51] Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. Self-supervised learning of motion capture. In NIPS, volume 30, 2017.
  • [52] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3d human body shapes. In ECCV, pages 20–36, 2018.
  • [53] Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In CVPR, pages 109–117, 2017.
  • [54] Timo von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, pages 601–617, 2018.
  • [55] Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, and Hongsheng Li. Encoder-decoder with multi-level attention for 3d human shape and pose estimation. In ICCV, pages 13033–13042, 2021.
  • [56] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, pages 52–67, 2018.
  • [57] Yuanlu Xu, Song-Chun Zhu, and Tony Tung. Denserac: Joint 3d pose and shape estimation by dense render-and-compare. In ICCV, pages 7760–7770, 2019.
  • [58] Chun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, and Ming-Hsuan Yang. Learning visibility for robust dense human body estimation. In ECCV, 2022.
  • [59] Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. Faster voxelpose: Real-time 3d human pose estimation by orthographic projection. In ECCV, pages 142–159. Springer, 2022.
  • [60] Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In CVPR, pages 2148–2157, 2018.
  • [61] Mihai Zanfir, Andrei Zanfir, Eduard Gabriel Bazavan, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Thundr: Transformer-based 3d human reconstruction with markers. In ICCV, pages 12971–12980, 2021.
  • [62] Wang Zeng, Wanli Ouyang, ** Luo, Wentao Liu, and Xiaogang Wang. 3d human mesh regression with dense correspondence. In CVPR, pages 7054–7063, 2020.
  • [63] Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. Learning 3d human shape and pose from dense body parts. IEEE TPAMI, 44(5):2610–2627, 2022.
  • [64] Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, pages 11446–11456, 2021.
  • [65] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenyu Liu, and Wenjun Zeng. Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE TPAMI, 45(2):2613–2626, 2022.

Appendix

We elaborate on the post-processing implementation of the virtual markers and provide additional experimental details and results. At last, we discuss data from human subjects and the potential societal impact.

A.  Post-processing on Virtual Markers

As described in Section 3.1, considering the left-right symmetric human body structure, we slightly adjust the learned virtual markers 𝐙𝐙\mathbf{Z}bold_Z to be symmetric. In fact, after the first step that updates each 𝐳isubscript𝐳𝑖\mathbf{z}_{i}bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by its nearest vertex to get 𝐙~3×K~𝐙superscript3𝐾\widetilde{\mathbf{Z}}\in\mathbb{R}^{3\times K}over~ start_ARG bold_Z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_K end_POSTSUPERSCRIPT. 𝐙~~𝐙\widetilde{\mathbf{Z}}over~ start_ARG bold_Z end_ARG are almost symmetric with few exceptions. To get the final symmetric virtual markers 𝐙~sym3×Ksuperscript~𝐙𝑠𝑦𝑚superscript3𝐾\widetilde{\mathbf{Z}}^{sym}\in\mathbb{R}^{3\times K}over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_K end_POSTSUPERSCRIPT, for each virtual marker located in the left body part, we take its symmetric vertex in the right body to be its symmetric counterpart.

Since the human mesh (i.e. SMPL [36]) itself is not strictly symmetric, we clarify the symmetric vertex pair (e.g. left elbow and right elbow) on a human mesh template 𝐗t3×Msuperscript𝐗𝑡superscript3𝑀\mathbf{X}^{t}\in\mathbb{R}^{3\times M}bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_M end_POSTSUPERSCRIPT in Figure 9. We place 𝐗tsuperscript𝐗𝑡\mathbf{X}^{t}bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the origin of the 3D coordinate system. Formally, we define the cost of matching ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT vertex to jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT vertex to be 𝑪i,j=|xi+xj|+|yiyj|+|zizj|subscript𝑪𝑖𝑗subscript𝑥𝑖subscript𝑥𝑗subscript𝑦𝑖subscript𝑦𝑗subscript𝑧𝑖subscript𝑧𝑗\bm{C}_{i,j}=\left|x_{i}+x_{j}\right|+\left|y_{i}-y_{j}\right|+\left|z_{i}-z_{% j}\right|bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | + | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |. A symmetric vertex pair (𝐗it,𝐗jt)subscriptsuperscript𝐗𝑡𝑖subscriptsuperscript𝐗𝑡𝑗(\mathbf{X}^{t}_{i},\mathbf{X}^{t}_{j})( bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is defined to have the minimal cost 𝑪i,jsubscript𝑪𝑖𝑗\bm{C}_{i,j}bold_italic_C start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. In this way, for each virtual marker in the left body, we take its symmetric vertex counterpart to be its symmetric virtual marker and finally get 𝐙~symsuperscript~𝐙𝑠𝑦𝑚\widetilde{\mathbf{Z}}^{sym}over~ start_ARG bold_Z end_ARG start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT.

Refer to caption
Figure 9: Illustration of the human mesh template 𝐗tsuperscript𝐗𝑡\mathbf{X}^{t}bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the 3D coordinate system and a symmetric vertex pair (𝐗it,𝐗jt)subscriptsuperscript𝐗𝑡𝑖subscriptsuperscript𝐗𝑡𝑗(\mathbf{X}^{t}_{i},\mathbf{X}^{t}_{j})( bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_X start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

B.  Experiments

In this section, we first add detailed descriptions for datasets and then provide more experimental results of our approach.

B.1  Datasets

H3.6M [16].

Following previous works [19, 25, 26, 42], we use the SMPL parameters generated from MoSh [35], which are fitted to the 3D physical marker locations, to get the GT 3D mesh supervision. Following standard practice [19], we evaluate the quality of 3D pose of 14141414 joints derived from the estimated mesh, i.e. 𝐌^𝒥^𝐌𝒥\hat{\mathbf{M}}\mathcal{J}over^ start_ARG bold_M end_ARG caligraphic_J. We report Mean Per Joint Position Error (MPJPE) and PA-MPJPE in millimeters (mm). The latter uses Procrustes algorithm [13] to align the estimates to GT poses before computing MPJPE. To evaluate mesh estimation results, we also report Mean Per Vertex Error (MPVE) which can be interpreted as MPJPE computed over the whole mesh.

3DPW [54].

The 3D GT SMPL parameters are obtained by using the data from IMUs when collected. Following the previous works [32, 33, 24, 61], we use the train set of 3DPW to learn the model and evaluate on the test set.

MPI-INF-3DHP [38]

is a 3D pose dataset with 3D GT pose annotations. Since this dataset does not provide 3D mesh annotations, following [19, 25], we only enforce supervision on the 3D skeletons (Eq. (9)) in mesh losses.

UP-3D [28]

is a wild 2D pose dataset with natural images. The 3D poses and meshes are obtained by SMPLify [2]. Due to the lack of GT 3D poses, the fitted meshes are not accurate. Therefore we only use the 2D annotations to train the 3D virtual marker estimation network as in [47].

COCO [34]

is a large wild 2D pose dataset with natural images. Previous work [39] used SMPLify-X [41] to obtain pseudo SMPL mesh annotations but they are not accurate. However, we find that if we project the 3D mesh to 2D image, the resulting 2D mesh vertices align well with the image. So we leverage the 2D annotations to train the virtual marker estimation network as in [47].

SURREAL [53]

is a large-scale synthetic dataset containing 6 million frames of synthetic humans. The images are photo-realistic renderings of people under large variations in shape, texture, viewpoint, and body pose. To ensure realism, the synthetic bodies are created using the SMPL body model, whose parameters are fit by the MoSh [35] given raw 3D physical marker data. All the images have a resolution of 320×240320240320\times 240320 × 240. We use the same training split to train the model and evaluate the test split following [7].

B.2  Implementation Details and Computation Resource

Following common practice [19, 7, 39, 61, 26, 23, 33, 32], we conduct mix-training by using the above 2D and 3D datasets for experiments on the H3.6M and 3DPW datasets. To leverage the 3D pose estimation dataset, i.e. MPI-INF-3DHP [38], we extend the 64646464 virtual markers with the 17171717 landmark joints (i.e. skeleton) from the MPI-INF-3DHP dataset. For experiments on the SURREAL dataset, we use its training set alone as in [7, 37]. We implement the proposed method with PyTorch. All the experiments are conducted on a Linux machine with 4 NVIDIA 16GB V100 GPUs. The whole network is trained for 40 epochs with batch size 32 using Adam [22] optimizer.

We evaluate the model complexity in terms of FLOPs (G) and the number of model parameters in Table 7. Compared to the most recent state-of-the-art methods that directly regress all mesh vertices, such as I2L-MeshNet [39], METRO [32], and Mesh Graphormer [33], our approach with virtual marker representation reduces the computation overhead by a large margin while getting better estimation quality. The last column shows the MPVE errors on 3DPW test set for performance reference.

Methods FLOPs (G) \downarrow Params (M) MPVE\downarrow
I2L-MeshNet [39] ECCV’20 28.7 141.2 110.1
METRO [32] CVPR’21 153.0 397.5 88.2
Mesh Graphormer [33] ICCV’21 48.8 180.6 87.7
Ours 22.1 109.6 77.9
Table 7: Computation overhead comparison with the recent state-of-the-art methods that directly regress all 3D vertices. The rightmost column shows the MPVE errors on the 3DPW test set for performance reference.
Ours w/o confsubscript𝑐𝑜𝑛𝑓\mathcal{L}_{conf}caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_f end_POSTSUBSCRIPT w/o posesubscript𝑝𝑜𝑠𝑒\mathcal{L}_{pose}caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT w/o normalsubscript𝑛𝑜𝑟𝑚𝑎𝑙\mathcal{L}_{normal}caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT w/o edgesubscript𝑒𝑑𝑔𝑒\mathcal{L}_{edge}caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT
MPVE\downarrow 58.0 59.2 58.3 60.6 60.4
Table 8: MPVE errors on H3.6M [16] test set when ablating different loss terms.
Occ. VM Parts MPVE\downarrow MPJPE\downarrow PA-MPJPE\downarrow
None (Ours) 77.9 67.5 41.3
2 Arms 79.2 \uparrow 1.3 68.2 \uparrow 0.7 42.2 \uparrow 0.9
2 Legs 78.3 \uparrow 0.4 67.9 \uparrow 0.4 41.7 \uparrow 0.4
Body 78.6 \uparrow 0.7 68.0 \uparrow 0.5 41.8 \uparrow 0.5
Random 78.7 \uparrow 0.8 68.1 \uparrow 0.6 41.9 \uparrow 0.6
Table 9: Results on 3DPW [54] test set when different parts of virtual markers (VM) are occluded.

B.3  Additional Quantitative Results

Refer to caption
Figure 10: Meshes estimated by our approach on Internet images with challenging cases (complex poses or extreme body shapes).
Different loss terms.

Table 8 reports the MPVE error on H3.6M [16] test set when ablating different loss terms. The confidence loss [17] is used to encourage the interpretability of the heatmaps to have a maxima response at the GT position. Without the confidence loss, the error increases slightly. If ablating the surface losses, MPVE increases a lot, which demonstrates the smoothing effect of these two terms.

Robustness to occlusion.

We report results when different virtual markers are occluded by a synthetic mask in Table 9. The errors are slightly larger than the original image (None), which validates the effectiveness of the locality of the virtual marker representation. Occluding arm regions results in a larger error increase. This may be because the arm has larger variations in the dataset.

Comparison to fitting.

In order to disentangle the ability of mesh regression from markers using 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG with the ability to detect the virtual markers accurately from images, we first compute the estimation errors of the virtual markers. The MPJPE over all the virtual markers is 35.535.535.535.5mm, which demonstrates that these virtual markers can be accurately detected from the images. We then fit the mesh model parameters to these virtual markers. Table 10 shows the metrics of the fitted mesh on the SURREAL [53] test set. As we can see, the fitted mesh has a similar error as our regression ones which uses the interpolation matrix 𝐀^^𝐀\hat{\mathbf{A}}over^ start_ARG bold_A end_ARG, which validates the accuracy of the estimated virtual markers.

B.4  Additional Qualitative Results

Figure 12 shows more qualitative comparisons with Pose2Mesh [7] on the SURREAL test set in which has diverse body shapes. The skeleton representation used in Pose2Mesh loses the body shape information so the method [7] can only recover mean shapes. For example, in Figure 12 (d) (e), the estimated meshes of Pose2Mesh tend to have the average body shape and fail to estimate the real body shape, regardless of whether the person is slim or stout. This is caused by the limited skeleton representation bottleneck so that the model learns a mean shape for the whole training dataset implicitly. In contrast, our approach with virtual marker representation generates more accurate mesh estimation results.

     Method      MPVE\downarrow      MPJPE\downarrow      PA-MPJPE\downarrow
     Fitting      44.6      34.8      29.5
     Ours      44.7      36.9      28.9
Table 10: Results on SURREAL [53] test set when the mesh is obtained by fitting to the estimated virtual markers.

Figure 13 shows more qualitative comparisons with Pose2Mesh [7] and METRO [32] on the 3DPW test set. Pose2Mesh and METRO use the skeleton or all 3D vertices as intermediate representations, respectively. The estimated meshes are overlaid on the images according to the camera parameters. Pose2Mesh [7] has difficulty in estimating correct body pose and shapes when truncation occurs (a) or in complex postures (c). The results of METRO [32] have many artifacts where the estimated mesh is not smooth, and they also fail to align the image well. In contrast, our method estimates more accurate human poses and shapes and has smooth human mesh results. In addition, it is more robust to truncation and occlusion and aligns the image better.

Refer to caption
Figure 11: Typical failure cases. (a) The right arm has inaccurate shape estimation due to the inaccurate virtual marker estimation around the arm when occluded. (b) Our method treats the lower arm of another person as its own due to occlusion. (c) Interpenetration around the right hand.

Figure 14 shows more quality results of our approach on the 3DPW [54], H3.6M [16], MPI-INF-3DHP [38], and COCO [34] datasets. Figure 10 shows more qualitative results on Internet images with challenging cases, such as extreme body shapes or complex poses. Our method generalizes well on the natural scenes. Figure 11 shows typical failure cases, including inaccurate shape estimation and interpenetration, which are mainly caused by inaccurate 3D virtual marker estimation when occlusion occurs. But as expected, the rest body parts are barely affected due to the local and sparse property of the virtual marker.

Refer to caption
Figure 12: Qualitative comparison between our method and Pose2Mesh [7] on SURREAL test set [53]. Our approach generates more accurate mesh estimation results on images of diverse body shapes.
Refer to caption
Figure 13: Qualitative comparison between our method and Pose2Mesh [7], METRO [32] on 3DPW test set [54]. Our approach is more robust to occlusion and truncation and generates more accurate mesh estimation results that align images well.
Refer to caption
Figure 14: Meshes estimated by our approach on images from the 3DPW [54] dataset (row 1-4), H3.6M [16] dataset (row 5), MPI-INF-3DHP [38] dataset (row 6), and COCO dataset (last 2 rows) [34].

C.  Human Subject Data

We use existing public datasets of human subjects in our experiments following their official licensing requirements. With proper usage, the proposed method could be beneficial to society (e.g. elderly fall detection).