3D Human Mesh Estimation from Virtual Markers

Xiaoxuan Ma¹ Jiajun Su¹ Chunyu Wang ³ Wentao Zhu¹ Yizhou Wang^{1, 2, 4}
¹School of Computer Science, Center on Frontiers of Computing Studies, Peking University
²Inst. for Artificial Intelligence, Peking University
³Microsoft Research Asia
⁴Nat’l Eng. Research Center of Visual Technology
{maxiaoxuan, sujiajun, wtzhu, yizhou.wang}@pku.edu.cn, [email protected]
Corresponding author

Abstract

Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker.

1 Introduction

Refer to caption — Figure 1: Mesh estimation results on four examples with different body shapes. Pose2Mesh [7] which uses 3D skeletons as the intermediate representation fails to predict accurate shapes. Our virtual marker-based method obtains accurate estimates.

3D human mesh estimation aims to estimate the 3D positions of the mesh vertices that are on the body surface. The task has attracted a lot of attention from the computer vision and computer graphics communities [3, 43, 30, 35, 51, 19, 25, 37, 27, 10] because it can benefit many applications such as virtual reality [15]. Recently, the deep learning-based methods [19, 7, 29] have significantly advanced the accuracy on the benchmark datasets.

The pioneer methods [51, 19] propose to regress the pose and shape parameters of the mesh models such as SMPL [36] directly from images. While straightforward, their accuracy is usually lower than the state-of-the-arts. The first reason is that the map** from the image features to the model parameters is highly non-linear and suffers from image-model misalignment [29]. Besides, existing mesh datasets [16, 54, 38, 28] are small and limited to simple laboratory environments due to the complex capturing process. The lack of sufficient training data severely limits its performance.

Recently, some works [26, 39] begin to formulate mesh estimation as a dense 3D keypoint detection task inspired by the success of volumetric pose estimation [47, 50, 65, 44, 59, 45]. For example, in [26, 39], the authors propose to regress the 3D positions of all vertices. However, it is computationally expensive because it has more than several thousand vertices. Moon and Lee [39] improve the efficiency by decomposing the 3D heatmaps into multiple 1D heatmaps at the cost of mediocre accuracy. Choi et al. [7] propose to first detect a sparser set of skeleton joints in the images, from which the dense 3D meshes are regressed by exploiting the mesh topology. The methods along this direction have attracted increasing attention [7, 29, 55] due to two reasons. First, the proxy task of 3D skeleton estimation can leverage the abundant 2D pose datasets which notably improves the accuracy. Second, mesh regression from the skeletons is efficient. However, important information about the body shapes is lost in extracting the 3D skeletons, which is largely overlooked previously. As a result, different types of body shapes, such as lean or obese, cannot be accurately estimated (see Figure 1).

The professional marker-based motion capture (mocap) method MoSh [35] places physical markers on the body surface and explore their subtle non-rigid motions to extract meshes with accurate shapes. However, the physical markers limit the approach to be used in laboratory environments. We are inspired to think whether we can identify a set of landmarks on the mesh as virtual markers, e.g., elbow and wrist, that can be detected from wild images, and allow to recover accurate body shapes? The desired virtual markers should satisfy several requirements. First, the number of markers should be much smaller than that of the mesh vertices so that we can use volumetric representations to efficiently estimate their 3D positions. Second, the markers should capture the mesh topology so that the intact mesh can be accurately regressed from them. Third, the virtual markers have distinguishable visual patterns so that they can be detected from images.

In this work, we present a learning algorithm based on archetypal analysis [12] to identify a subset of mesh vertices as the virtual markers that try to satisfy the above requirements to the best extent. Figure 2 shows that the learned virtual markers coarsely outline the body shape and pose which paves the way for estimating meshes with accurate shapes. Then we present a simple framework for 3D mesh estimation on top of the representation as shown in Figure 3. It first learns a 3D keypoint estimation network based on [47] to detect the 3D positions of the virtual markers. Then we recover the intact mesh simply by interpolating them. The interpolation weights are pre-trained in the representation learning step and will be adjusted by a light network based on the prediction confidences of the virtual markers for each image.

We extensively evaluate our approach on three benchmark datasets. It consistently outperforms the state-of-the-art methods on all of them. In particular, it achieves a significant gain on the SURREAL dataset [53] which has a variety of body shapes. Our ablation study also validates the advantages of the virtual marker representation in terms of recovering accurate shapes. Finally, the method shows decent generalization ability and generates visually appealing results for the wild images.

2 Related work

2.1 Optimization-based mesh estimation

Before deep learning dominates this field, 3D human mesh estimation [35, 2, 28, 41, 60] is mainly optimization-based, which optimizes the parameters of the human mesh models to match the observations. For example, Loper et al. [35] propose MoSh that optimizes the SMPL parameters to align the mesh with the 3D marker positions. It is usually used to get GT 3D meshes for benchmark datasets because of its high accuracy. Later works propose to optimize the model parameters or mesh vertices based on 2D image cues [2, 28, 41, 60, 11]. They extract intermediate representations such as 2D skeletons from the images and optimize the mesh model by minimizing the discrepancy between the model projection and the intermediate representations such as the 2D skeletons. These methods are usually sensitive to initialization and suffer from local optimum.

2.2 Learning-based mesh estimation

Recently, most works follow the learning-based framework and have achieved promising results. Deep networks [51, 19, 25, 37, 27] are used to regress the SMPL parameters from image features. However, learning the map** from the image space to the parameter space is highly non-linear [39]. In addition, they suffer from the misalignment between the meshes and image pixels [62]. These problems make it difficult to learn an accurate yet generalizable model.

Some works propose to introduce proxy tasks to get intermediate representations first, ho** to alleviate the learning difficulty. In particular, intermediate representations of physical markers [61], IUV images [57, 62, 64, 63], body part segmentation masks [52, 24, 28, 40] and body skeletons [49, 7, 29, 55] have been proposed. In particular, THUNDR [61] first estimates the 3D locations of physical markers from images and then reconstructs the mesh from the 3D markers. The physical markers can be interpreted as a simplified representation of body shape and pose. Although it is very accurate, it cannot be applied to wild images without markers. In contrast, body skeleton is a popular human representation that can be robustly detected from wild images. Choi et al. [7] propose to first estimate the 3D skeletons, and then estimate the intact mesh from them. However, accurate body shapes are difficult to be recovered from the oversimplified 3D skeletons.

Our work belongs to the learning-based class and is related to works that use physical markers or skeletons as intermediate representations. But different from them, we propose a novel intermediate representation, named virtual markers, which is more expressive to reduce the ambiguity in pose and shape estimation than body skeletons and can be applied to wild images.

3 Method

In this section, we describe the details of our approach. First, Section 3.1 introduces how we learn the virtual marker representation from mocap data. Then we present the overall framework for mesh estimation from an image in Section 3.2. At last, Section 3.3 discusses the loss functions and training details.

3.1 The virtual marker representation

We represent a mesh by a vector of vertex positions $\mathbf{x}\in\mathbb{R}^{3M}$ where $M$ is the number of mesh vertices. Denote a mocap dataset such as [16] with $N$ meshes as $\overset{\frown}{\mathbf{X}}=[\mathbf{x}_{1},\,...,\,\mathbf{x}_{N}]\in\mathbb% {R}^{3M\times N}$ . To unveil the latent structure among vertices, we reshape it to $\mathbf{X}\in\mathbb{R}^{3N\times M}$ with each column $\mathbf{x}_{i}\in\mathbb{R}^{3N}$ representing all possible positions of the $i^{\text{th}}$ vertex in the dataset [16].

The rank of $\mathbf{X}$ is smaller than $M$ because the mesh representation is smooth and redundant where some vertices can be accurately reconstructed by the others. While it seems natural to apply PCA [18] to $\mathbf{X}$ to compute the eigenvectors as virtual markers for reconstructing others, there is no guarantee that the virtual markers correspond to the mesh vertices, making them difficult to be detected from images. Instead, we aim to learn $K$ virtual markers $\mathbf{Z}=[\mathbf{z}_{1},...,\mathbf{z}_{K}]\in\mathbb{R}^{3N\times K}$ that try to satisfy the following two requirements to the greatest extent. First, they can accurately reconstruct the intact mesh $\mathbf{X}$ by their linear combinations: $\mathbf{X}=\mathbf{Z}\mathbf{A}$ , where $\mathbf{A}\in\mathbb{R}^{K\times M}$ is a coefficient matrix that encodes the spatial relationship between the virtual markers and the mesh vertices. Second, they should have distinguishable visual patterns in images so that they can be easily detected from images. Ideally, they can be on the body surface as the meshes.

We apply archetypal analysis [12, 4] to learn $\mathbf{Z}$ by minimizing a reconstruction error with two additional constraints: (1) each vertex $\mathbf{x}_{i}$ can be reconstructed by convex combinations of $\mathbf{Z}$ , and (2) each marker $\mathbf{z}_{i}$ should be convex combinations of the mesh vertices $\mathbf{X}$ :

\min_{\begin{subarray}{c}\bm{\alpha}_{i}\in{\Delta}_{K}\,for\,1\leq i\leq M,\\ \bm{\beta}_{j}\in{\Delta}_{M}\,for\,1\leq j\leq K\end{subarray}}||\mathbf{X}-% \mathbf{X}\mathbf{B}\mathbf{A}||^{2}_{F},\\

(1)

where $\mathbf{A}=[\bm{\alpha}_{1},...,\bm{\alpha}_{M}]\in\mathbb{R}^{K\times M}$ , each $\bm{\alpha}$ resides in the simplex ${\Delta}_{K}\triangleq\{\bm{\alpha}\in\mathbb{R}^{K}\,\mathrm{s.t.}\,\bm{% \alpha}\succeq 0\,\text{and}\,{||\bm{\alpha}||}_{1}=1\}$ , and $\mathbf{B}=[\bm{\beta}_{1},...,\bm{\beta}_{K}]\in\mathbb{R}^{M\times K}$ , $\bm{\beta}_{j}\in{\Delta}_{M}$ . We adopt Active-set algorithm [4] to solve objective (1) and obtain the learned virtual markers $\mathbf{Z}=\mathbf{X}\mathbf{B}\in\mathbb{R}^{3N\times K}$ . As shown in [12, 4], the two constraints encourage the virtual markers $\mathbf{Z}$ to unveil the latent structure among vertices, therefore they learn to be close to the extreme points of the mesh and located on the body surface as much as possible.

Type	Formula	Reconst. Error (mm) $\downarrow$
Original	$\|\|\mathbf{X}-\mathbf{X}\mathbf{B}\mathbf{A}\|\|^{2}_{F}$	11.67
Symmetric	$\|\|\mathbf{X}-\mathbf{X}\widetilde{\mathbf{B}}^{sym}\widetilde{\mathbf{A}}^{sym% }\|\|^{2}_{F}$	10.98

Table 1: The reconstruction errors using the original and the symmetric sets of markers on the H3.6M dataset [16], respectively. The errors are small indicating that they are sufficiently expressive and can reconstruct all vertices accurately.

Post-processing. Since human body is left-right symmetric, we adjust $\mathbf{Z}$ to reflect the property. We first replace each $\mathbf{z}_{i}\in\mathbf{Z}$ by its nearest vertex on the mesh and obtain $\widetilde{\mathbf{Z}}\in\mathbb{R}^{3\times K}$ . This step allows us to compute the left or right counterpart of each marker. Then we replace the markers in the right body with the symmetric vertices in the left body and obtain the symmetric markers $\widetilde{\mathbf{Z}}^{sym}\in\mathbb{R}^{3\times K}$ . Finally we update $\mathbf{B}$ and $\mathbf{A}$ by minimizing $||\mathbf{X}-\mathbf{X}\widetilde{\mathbf{B}}^{sym}\widetilde{\mathbf{A}}^{sym% }||^{2}_{F}$ subject to $\widetilde{\mathbf{Z}}^{sym}=\mathbf{X}\widetilde{\mathbf{B}}^{sym}$ . More details are elaborated in the supplementary.

Figure 2 shows the virtual markers learned on the mocap dataset [16] after post-processing. They are similar to the physical markers and approximately outline the body shape which agrees with our expectations. They are roughly evenly distributed on the surface of the body, and some of them are located close to the body keypoints, which have distinguishable visual patterns to be accurately detected. Table 1 shows the reconstruction errors of using original markers $\mathbf{X}\mathbf{B}$ and the symmetric markers $\mathbf{X}\widetilde{\mathbf{B}}^{sym}$ . Both can reconstruct meshes accurately.

3.2 Mesh estimation framework

On top of the virtual markers, we present a simple yet effective framework for end-to-end 3D human mesh estimation from a single image. As shown in Figure 3, it consists of two branches. The first branch uses a volumetric CNN [47] to estimate the 3D positions $\hat{\mathbf{P}}$ of the markers, and the second branch reconstructs the full mesh $\hat{\mathbf{M}}$ by predicting a coefficient matrix $\hat{\mathbf{A}}$ :

\small\hat{\mathbf{M}}=\hat{\mathbf{P}}\hat{\mathbf{A}}.

(2)

We will describe the two branches in more detail.

3D marker estimation. We train a neural network to estimate a 3D heatmap $\hat{\mathbf{H}}=[\hat{\mathbf{H}}_{1},\,...,\,\hat{\mathbf{H}}_{K}]\in\mathbb% {R}^{K\times D\times H\times W}$ from an image. The heatmap encodes per-voxel likelihood of each marker. There are $D\times H\times W$ voxels in total which are used to discretize the 3D space. The 3D position $\hat{\mathbf{P}}_{z}\in\mathbb{R}^{3}$ of each marker is computed as the center of mass of the corresponding heatmap $\hat{\mathbf{H}}_{z}$ [47] as follows:

\small\hat{\mathbf{P}}_{z}=\sum_{d=1}^{D}\sum_{h=1}^{H}\sum_{w=1}^{W}(d,h,w)% \cdot\hat{\mathbf{H}}_{z}(d,h,w).

(3)

The positions of all markers are represented as $\hat{\mathbf{P}}=[\hat{\mathbf{P}}_{1},\hat{\mathbf{P}}_{2},\cdots,\hat{% \mathbf{P}}_{K}]$ .

Interpolation. Ideally, if we have accurate estimates for all virtual markers $\hat{\mathbf{P}}$ , then we can recover the complete mesh by simply multiplying $\hat{\mathbf{P}}$ with a fixed coefficient matrix $\widetilde{\mathbf{A}}^{sym}$ with sufficient accuracy as validated in Table 1. However, in practice, some markers may have large estimation errors because they may be occluded in the monocular setting. Note that this happens frequently. For example, the markers in the back will be occluded when a person is facing the camera. As a result, inaccurate markers positions may bring large errors to the final mesh if we directly multiply them with the fixed matrix $\widetilde{\mathbf{A}}^{sym}$ .

Our solution is to rely more on those accurately detected markers. To that end, we propose to update the coefficient matrix based on the estimation confidence scores of the markers. In practice, we simply take the heatmap score at the estimated positions of each marker, i.e. $\hat{\mathbf{H}}_{z}(\hat{\mathbf{P}}_{z})$ , and feed them to a single fully-connected layer to obtain the coefficient matrix $\hat{\mathbf{A}}$ . Then the mesh is reconstructed by $\hat{\mathbf{M}}=\hat{\mathbf{P}}\hat{\mathbf{A}}$ .

3.3 Training

We train the whole network end-to-end in a supervised way. The overall loss function is defined as:

\displaystyle\mathcal{L}=\lambda_{vm}\mathcal{L}_{vm}+\lambda_{c}\mathcal{L}_{% conf}+\lambda_{m}\mathcal{L}_{mesh}.

(4)

Virtual marker loss. We define $\mathcal{L}_{vm}$ as the $L_{1}$ distance between the predicted 3D virtual markers $\hat{\mathbf{P}}$ and the GT $\hat{\mathbf{P}}^{*}$ as follows:

\displaystyle\mathcal{L}_{vm}=\|\hat{\mathbf{P}}-\hat{\mathbf{P}}^{*}\|_{1}.

(5)

Note that it is easy to get GT markers $\hat{\mathbf{P}}^{*}$ from GT meshes as stated in Section 3.1 without additional manual annotations.

Confidence loss. We also require that the 3D heatmaps have reasonable shapes, therefore, the heatmap score at the voxel containing the GT marker position $\hat{\mathbf{P}}_{z}^{*}$ should have the maximum value as in the previous work [17]:

\displaystyle\mathcal{L}_{conf}=-\sum_{z=1}^{K}log(\hat{\mathbf{H}}_{z}(\hat{% \mathbf{P}}_{z}^{*})).

(6)

Mesh loss. Following [39], we define $\mathcal{L}_{mesh}$ as a weighted sum of four losses:

\displaystyle\small\mathcal{L}_{mesh}=\mathcal{L}_{vertex}+\mathcal{L}_{pose}+% \mathcal{L}_{normal}+\lambda_{e}\mathcal{L}_{edge}.

(7)

–

Vertex coordinate loss. We adopt $L_{1}$ loss between predicted 3D mesh coordinates $\hat{\mathbf{M}}$ with GT mesh $\hat{\mathbf{M}}^{*}$ as:

\displaystyle\mathcal{L}_{vertex}=\|\hat{\mathbf{M}}-\hat{\mathbf{M}}^{*}\|_{1}.

(8)

–

Pose loss. We use $L_{1}$ loss between the 3D landmark joints regressed from mesh $\hat{\mathbf{M}}\mathcal{J}$ and the GT joints $\hat{\mathbf{J}}^{*}$ as:

\displaystyle\mathcal{L}_{pose}=\|\hat{\mathbf{M}}\mathcal{J}-\hat{\mathbf{J}}% ^{*}\|_{1},

(9)

where $\mathcal{J}\in\mathbb{R}^{M\times J}$ is a pre-defined joint regression matrix in SMPL model [2].

–

Surface losses. To improve surface smoothness [56], we supervise the normal vector of a triangle face with GT normal vectors by $\mathcal{L}_{normal}$ and the edge length of the predicted mesh with GT length by $\mathcal{L}_{edge}$ :

		$\displaystyle\mathcal{L}_{normal}=\sum_{f}\sum_{\{i,j\}\subset f}\left\|\left<% \frac{\hat{\mathbf{M}}_{i}-\hat{\mathbf{M}}_{j}}{\\|\hat{\mathbf{M}}_{i}-\hat{% \mathbf{M}}_{j}\\|_{2}},\hat{\mathbf{n}}_{f}^{*}\right>\right\|,$		(10)
		$\displaystyle\mathcal{L}_{edge}=\sum_{f}\sum_{\{i,j\}\subset f}\left\|\\|\hat{% \mathbf{M}}_{i}-\hat{\mathbf{M}}_{j}\\|_{2}-\\|\hat{\mathbf{M}}_{i}^{}-\hat{% \mathbf{M}}_{j}^{}\\|_{2}\right\|.$		(10)

where $f$ and $\hat{\mathbf{n}}_{f}^{*}$ denote a triangle face in the mesh and its GT unit normal vector, respectively. $\hat{\mathbf{M}}_{i}$ denote the $i^{th}$ vertex of $\hat{\mathbf{M}}$ . ^∗ denotes GT.

4 Experiments

Method	Intermediate	H3.6M			3DPW
Method	Representation	MPVE $\downarrow$	MPJPE $\downarrow$	PA-MPJPE $\downarrow$	MPVE $\downarrow$	MPJPE $\downarrow$	PA-MPJPE $\downarrow$
^† Arnab et al. [1] CVPR’19	2D skeleton	-	77.8	54.3	-	-	72.2
^† HMMR [20] CVPR’19	-	-	-	56.9	139.3	116.5	72.6
^† DSD-SATN [49] ICCV’19	3D skeleton	-	59.1	42.4	-	-	69.5
^† VIBE [23] CVPR’20	-	-	65.9	41.5	99.1	82.9	51.9
^† TCMR [6] CVPR’21	-	-	62.3	41.1	102.9	86.5	52.7
^† MAED [55] ICCV’21	3D skeleton	-	56.3	38.7	92.6	79.1	45.7
SMPLify [2] ECCV’16	2D skeleton	-	-	82.3	-	-	-
HMR [19] CVPR’18	-	96.1	88.0	56.8	152.7	130.0	81.3
GraphCMR [26] CVPR’19	3D vertices	-	-	50.1	-	-	70.2
SPIN [25] ICCV’19	-	-	-	41.1	116.4	96.9	59.2
DenseRac [57] ICCV’19	IUV image	-	76.8	48.0	-	-	-
DecoMR [62] CVPR’20	IUV image	-	60.6	39.3	-	-	-
ExPose [9] ECCV’20	-	-	-	-	-	93.4	60.7
Pose2Mesh [7] ECCV’20	3D skeleton	85.3	64.9	46.3	106.3	88.9	58.3
I2L-MeshNet [39] ECCV’20	3D vertices	65.1	55.7	41.1	110.1	93.2	57.7
PC-HMR [37] AAAI’21	3D skeleton	-	-	-	108.6	87.8	66.9
HybrIK [29] CVPR’21	3D skeleton	65.7	54.4	34.5	86.5	74.1	45.0
METRO [32] CVPR’21	3D vertices	-	54.0	36.7	88.2	77.1	47.9
ROMP [48] ICCV’21	-	-	-	-	108.3	91.3	54.9
Mesh Graphormer[33] ICCV’21	3D vertices	-	51.2	34.5	87.7	74.7	45.6
PARE [24] ICCV’21	Segmentation	-	-	-	88.6	74.5	46.5
THUNDR [61] ICCV’21	3D markers	-	55.0	39.8	88.0	74.8	51.5
PyMaf [64] ICCV’21	IUV image	-	57.7	40.5	110.1	92.8	58.9
ProHMR [27] ICCV’21	-	-	-	41.2	-	-	59.8
OCHMR [21] CVPR’22	2D heatmap	-	-	-	107.1	89.7	58.3
3DCrowdNet [8] CVPR’22	3D skeleton	-	-	-	98.3	81.7	51.5
CLIFF [31] ECCV’22	-	-	47.1	32.7	81.2	69.0	43.0
FastMETRO [5] ECCV’22	3D vertices	-	52.2	33.7	84.1	73.5	44.6
VisDB [58] ECCV’22	3D vertices	-	51.0	34.5	85.5	73.5	44.9
Ours	Virtual marker	58.0	47.3	32.0	77.9	67.5	41.3

Table 2: Comparison to the state-of-the-arts on H3.6M [16] and 3DPW [54] datasets. ^† means using temporal cues. The methods are not strictly comparable because they may have different backbones and training datasets. We provide the numbers only to show proof-of-concept results.

4.1 Datasets and metrics

H3.6M [16]. We use (S1, S5, S6, S7, S8) for training and (S9, S11) for testing. As in [19, 7, 32, 33], we report MPJPE and PA-MPJPE for poses that are derived from the estimated meshes. We also report Mean Per Vertex Error (MPVE) for the whole mesh.

3DPW [54] is collected in natural scenes. Following the previous works [32, 33, 24, 61], we use the train set of 3DPW to learn the model and evaluate on the test set. The same evaluation metrics as H3.6M are used.

SURREAL [53] is a large-scale synthetic dataset with GT SMPL annotations and has diverse samples in terms of body shapes, backgrounds, etc. We use its training set to train a model and evaluate the test split following [7].

4.2 Implementation Details

We learn $64$ virtual markers on the H3.6M [16] training set. We use the same set of markers for all datasets instead of learning a separate set on each dataset. Following [19, 7, 39, 61, 26, 23, 33, 32], we conduct mix-training by using MPI-INF-3DHP [38], UP-3D [28], and COCO [34] training set for experiments on the H3.6M and 3DPW datasets. We adapt a 3D pose estimator [47] with HRNet-W48 [46] as the image feature backbone for estimating the 3D virtual markers. We set the number of voxels in each dimension to be $64$ , i.e. $D=H=W=64$ for 3D heatmaps. Following [19, 26, 39], we crop every single human region from the input image and resize it to $256\times 256$ . We use Adam [22] optimizer to train the whole framework for $40$ epochs with a batch size of $32$ . The learning rates for the two branches are set to $5\times 10^{-4}$ and $1\times 10^{-3}$ , respectively, which are decreased by half after the $30^{th}$ epoch. Please refer to the supplementary for more details.

Method	Intermediate	MPVE $\downarrow$	MPJPE $\downarrow$	PA-MPJPE $\downarrow$
Method	Representation	MPVE $\downarrow$	MPJPE $\downarrow$	PA-MPJPE $\downarrow$
HMR [19] CVPR’18	-	85.1	73.6	55.4
BodyNet [52] ECCV’18	Skel. + Seg.	65.8	-	-
GraphCMR [26] CVPR’19	3D vertices	103.2	87.4	63.2
SPIN [25] ICCV’19	-	82.3	66.7	43.7
DecoMR [62] CVPR’20	IUV image	68.9	52.0	43.0
Pose2Mesh [7] ECCV’20	3D skeleton	68.8	56.6	39.6
PC-HMR [37] AAAI’21	3D skeleton	59.8	51.7	37.9
^∗ DynaBOA [14] TPAMI’22	-	70.7	55.2	34.0
Ours	Virtual marker	44.7	36.9	28.9

Table 3: Comparison to the state-of-the-arts on SURREAL [53] dataset. ^∗ means training on the test split with 2D supervisions. “Skel. + Seg.” means using skeleton and segmentation together.

4.3 Comparison to the State-of-the-arts

Results on H3.6M. Table 2 compares our approach to the state-of-the-art methods on the H3.6M dataset. Our method achieves competitive or superior performance. In particular, it outperforms the methods that use skeletons (Pose2Mesh [7], DSD-SATN [49]), body markers (THUNDR) [61], or IUV image [62, 64] as proxy representations, demonstrating the effectiveness of the virtual marker representation.

Results on 3DPW. We compare our method to the state-of-the-art methods on the 3DPW dataset in Table 2. Our approach achieves state-of-the-art results among all the methods, validating the advantages of the virtual marker representation over the skeleton representation used in Pose2Mesh [7], DSD-SATN [49], and other representations like IUV image used in PyMAF [64]. In particular, our approach outperforms I2L-MeshNet [39], METRO [32], and Mesh Graphormer [33] by a notable margin, which suggests that virtual markers are more suitable and effective representations than detecting all vertices directly as most of them are not discriminative enough to be accurately detected.

Results on SURREAL. This dataset has more diverse samples in terms of body shapes. The results are shown in Table 3. Our approach outperforms the state-of-the-art methods by a notable margin, especially in terms of MPVE. Figure 1 shows some challenging cases without cherry-picking. The skeleton representation loses the body shape information so the method [7] can only recover mean shapes. In contrast, our approach generates much more accurate mesh estimation results.

No.	Intermediate	MPVE $\downarrow$
No.	Representation	H3.6M	SURREAL
(a)	Skeleton	64.4	53.6
(b)	Rand virtual marker	63.0	50.1
(c)	Virtual marker	58.0	44.7

Table 4: Ablation study of the virtual marker representation for our approach on H3.6M and SURREAL datasets. “Skeleton” means the sparse landmark joint representation is used. “Rand virtual marker” means the virtual markers are randomly selected from all the vertices without learning. (c) is our method, where the learned virtual markers are used.

4.4 Ablation study

Virtual marker representation. We compare our method to two baselines in Table 4. First, in baseline (a), we replace the virtual markers of our method with the skeleton representation. The rest are kept the same as ours (c). Our method achieves a much lower MPVE than the baseline (a), demonstrating that the virtual markers help to estimate body shapes more accurately than the skeletons. In baseline (b), we randomly sample $64$ from the $6890$ mesh vertices as virtual markers. We repeat the experiment five times and report the average number. We can see that the result is worse than ours, which is because the randomly selected vertices may not be expressive to reconstruct the other vertices or can not be accurately detected from images as they lack distinguishable visual patterns. The results validate the effectiveness of our learning strategy.

Figure 1 shows some qualitative results on the SURREAL test set. The meshes estimated by the baseline which uses skeleton representation, i.e. Pose2Mesh [7], have inaccurate body shapes. This is reasonable because the skeleton is oversimplified and has very limited capability to recover shapes. Instead, it implicitly learns a mean shape for the whole training dataset. In contrast, the mesh estimated by using virtual markers has much better quality due to its strong representation power and therefore can handle different body shapes elegantly. Figure 4 also shows some qualitative results on the H3.6M test set. For clarity, we draw the intermediate representation (blue balls) in it as well.

Number of virtual markers. We evaluate how the number of virtual markers affects estimation quality on H3.6M [16] dataset. Figure 5 visualizes the learned virtual markers, which are all located on the body surface and close to the extreme points of the mesh. This is expected as mentioned in Section 3.1. Table 5 (GT) shows the mesh reconstruction results when we have GT 3D positions of the virtual markers in objective (1). When we increase the number of virtual markers, both mesh reconstruction error (MPVE) and the regressed landmark joint error (MPJPE) steadily decrease. This is expected because using more virtual markers improves the representation power. However, using more virtual markers cannot guarantee smaller estimation errors when we need to estimate the virtual marker positions from images as in our method. This is because the additional virtual markers may have large estimation errors which affect the mesh estimation result. The results are shown in Table 5 (Det). Increasing the number of virtual markers $K$ steadily reduces the MPVE errors when $K$ is smaller than $96$ . However, if we keep increasing $K$ , the error begins to increase. This is mainly because some of the newly introduced virtual markers are difficult to detect from images and therefore bring errors to mesh estimation.

$K$	GT		Det
$K$	MPVE $\downarrow$	MPJPE $\downarrow$	MPVE $\downarrow$	MPJPE $\downarrow$
16	46.8	39.8	58.7	47.8
32	20.1	14.2	58.2	48.3
64	11.0	7.5	58.0	47.3
96	9.9	5.6	59.6	48.2

Table 5: Ablation study of the different number of virtual markers (

K

) on H3.6M [16] dataset. (GT) Mesh reconstruction results when GT 3D positions of the virtual markers are used in objective (1). (Det) Mesh estimation results obtained by our proposed framework when we use different numbers of virtual markers (

K

Coefficient matrix. We compare our method to a baseline which uses the fixed coefficient matrix $\widetilde{\mathbf{A}}^{sym}$ . We show the quality comparison in Figure 6. We can see that the estimated mesh by a fixed coefficient matrix (a) has mostly correct pose and shape but there are also some artifacts on the mesh while using the updated coefficient matrix (b) can get better mesh estimation results. As shown in Table 6, using a fixed coefficient matrix gets larger MPVE and MPJPE errors than using the updated coefficient matrix. This is caused by the estimation errors of virtual markers when occlusion happens, which is inevitable since the virtual markers on the back will be self-occluded by the front body. As a result, inaccurate marker positions would bring large errors to the final mesh estimates if we directly use the fixed matrix.

No.	Method	Fixed $\widetilde{\mathbf{A}}^{sym}$	Updated $\hat{\mathbf{A}}$	MPVE $\downarrow$	MPJPE $\downarrow$
(a)	Ours (fixed)	✓	✗	64.7	51.6
(b)	Ours	✗	✓	58.0	47.3

Table 6: Ablation study of the coefficient matrix for our approach on H3.6M dataset. “fixed” means using the fixed coefficient matrix

\widetilde{\mathbf{A}}^{sym}

to reconstruct the mesh.

4.5 Qualitative Results

Figure 7 (top) presents some meshes estimated by our approach on natural images from the 3DPW test set. The rightmost case shows a typical failure where our method has a wrong pose estimate of the left leg due to heavy occlusion. We can see that the failure is constrained to the local region and the rest of the body still gets accurate estimates. We further analyze how inaccurate virtual markers would affect the mesh estimation, i.e. when part of human body is occluded or truncated. According to the finally learned coefficient matrix $\mathbf{\hat{A}}$ of our model, we highlight the relationship weights among virtual markers and all vertices in Figure 8. We can see that our model actually learns local and sparse dependency between each vertex and the virtual markers, e.g. for each vertex, the virtual markers that contribute the most are in a near range as shown in Figure 8 (b). Therefore, in inference, if a virtual marker has inaccurate position estimation due to occlusion or truncation, the dependent vertices may have inaccurate estimates, while the rest will be barely affected. Figure 2 (right) shows more examples where occlusion or truncation occurs, and our method can still get accurate or reasonable estimates robustly. Note that when truncation occurs, our method still guesses the positions of the truncated virtual markers.

Figure 7 (bottom) shows our estimated meshes on challenging cases, which indicates the strong generalization ability of our model on diverse postures and actions in natural scenes. Please refer to the supplementary for more quality results. Note that since the datasets do not provide supervision of head orientation, face expression, hands, or feet, the estimates of these parts are just in canonical poses inevitably. Apart from that, most errors are due to inaccurate 3D virtual marker estimation which may be addressed using more powerful estimators or more diverse training datasets in the future.

5 Conclusion

In this paper, we present a novel intermediate representation Virtual Marker, which is more expressive than the prevailing skeleton representation and more accessible than physical markers. It can reconstruct 3D meshes more accurately and efficiently, especially in handling diverse body shapes. Besides, the coefficient matrix in the virtual marker representation encodes spatial relationships among mesh vertices which allows the method to implicitly explore structure priors of human body. It achieves better mesh estimation results than the state-of-the-art methods and shows advanced generalization potential in spite of its simplicity.

Acknowledgement

This work was supported by MOST-2022ZD0114900 and NSFC-62061136001.

References

[1] Anurag Arnab, Carl Doersch, and Andrew Zisserman. Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, pages 3395–3404, 2019.
[2] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In ECCV, pages 561–578, 2016.
[3] Ronan Boulic, Pascal Bécheiraz, Luc Emering, and Daniel Thalmann. Integration of motion control techniques for virtual human and avatar real-time animation. In Proceedings of the ACM symposium on Virtual reality software and technology, pages 111–118, 1997.
[4] Yuansi Chen, Julien Mairal, and Zaid Harchaoui. Fast and robust archetypal analysis for representation learning. In CVPR, pages 1478–1485, 2014.
[5] Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In ECCV, 2022.
[6] Hongsuk Choi, Gyeongsik Moon, Ju Yong Chang, and Kyoung Mu Lee. Beyond static features for temporally consistent 3d human pose and shape from a video. In CVPR, pages 1964–1973, 2021.
[7] Hongsuk Choi, Gyeongsik Moon, and Kyoung Mu Lee. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV, pages 769–787, 2020.
[8] Hongsuk Choi, Gyeongsik Moon, JoonKyu Park, and Kyoung Mu Lee. Learning to estimate robust 3d human mesh from in-the-wild crowded scenes. In CVPR, pages 1475–1484, June 2022.
[9] Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas, and Michael J Black. Monocular expressive body regression through body-driven attention. In ECCV, pages 20–40, 2020.
[10] Hai Ci, Mingdong Wu, Wentao Zhu, Xiaoxuan Ma, Hao Dong, Fangwei Zhong, and Yizhou Wang. Gfpose: Learning 3d human pose prior with gradient fields. arXiv preprint arXiv:2212.08641, 2022.
[11] Enric Corona, Gerard Pons-Moll, Guillem Alenyà, and Francesc Moreno-Noguer. Learned vertex descent: a new direction for 3d human model fitting. In ECCV, pages 146–165. Springer, 2022.
[12] Adele Cutler and Leo Breiman. Archetypal analysis. Technometrics, 36(4):338–347, 1994.
[13] John C Gower. Generalized procrustes analysis. Psychometrika, 40(1):33–51, 1975.
[14] Shanyan Guan, **gwei Xu, Michelle Z He, Yunbo Wang, Bingbing Ni, and Xiaokang Yang. Out-of-domain human mesh reconstruction via dynamic bilevel online adaptation. IEEE TPAMI, 2022.
[15] Yinghao Huang, Federica Bogo, Christoph Lassner, Angjoo Kanazawa, Peter V Gehler, Javier Romero, Ijaz Akhter, and Michael J Black. Towards accurate marker-less human shape and pose estimation over time. In 3DV, pages 421–430, 2017.
[16] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7):1325–1339, 2013.
[17] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov. Learnable triangulation of human pose. In ICCV, pages 7718–7727, 2019.
[18] Ian T Jolliffe. Principal components in regression analysis. In Principal component analysis, pages 129–155. Springer, 1986.
[19] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In CVPR, pages 7122–7131, 2018.
[20] Angjoo Kanazawa, Jason Y Zhang, Panna Felsen, and Jitendra Malik. Learning 3d human dynamics from video. In CVPR, pages 5614–5623, 2019.
[21] Rawal Khirodkar, Shashank Tripathi, and Kris Kitani. Occluded human mesh recovery. In CVPR, pages 1715–1725, June 2022.
[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[23] Muhammed Kocabas, Nikos Athanasiou, and Michael J Black. Vibe: Video inference for human body pose and shape estimation. In CVPR, pages 5253–5263, 2020.
[24] Muhammed Kocabas, Chun-Hao P. Huang, Otmar Hilliges, and Michael J. Black. Pare: Part attention regressor for 3d human body estimation. In ICCV, pages 11127–11137, October 2021.
[25] Nikos Kolotouros, Georgios Pavlakos, Michael J Black, and Kostas Daniilidis. Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In ICCV, pages 2252–2261, 2019.
[26] Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. In CVPR, pages 4501–4510, 2019.
[27] Nikos Kolotouros, Georgios Pavlakos, Dinesh Jayaraman, and Kostas Daniilidis. Probabilistic modeling for human mesh recovery. In ICCV, pages 11605–11614, October 2021.
[28] Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In CVPR, pages 6050–6059, 2017.
[29] Jiefeng Li, Chao Xu, Zhicun Chen, Siyuan Bian, Lixin Yang, and Cewu Lu. Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pages 3383–3393, 2021.
[30] Yong-Lu Li, Liang Xu, Xinpeng Liu, Xijie Huang, Yue Xu, Shiyi Wang, Hao-Shu Fang, Ze Ma, Mingyang Chen, and Cewu Lu. Pastanet: Toward human activity knowledge engine. In CVPR, pages 382–391, 2020.
[31] Zhihao Li, Jianzhuang Liu, Zhensong Zhang, Songcen Xu, and Youliang Yan. Cliff: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022.
[32] Kevin Lin, Lijuan Wang, and Zicheng Liu. End-to-end human pose and mesh reconstruction with transformers. In CVPR, pages 1954–1963, 2021.
[33] Kevin Lin, Lijuan Wang, and Zicheng Liu. Mesh graphormer. In ICCV, pages 12939–12948, 2021.
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
[35] Matthew Loper, Naureen Mahmood, and Michael J Black. Mosh: Motion and shape capture from sparse markers. TOG, 33(6):1–13, 2014.
[36] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. TOG, 34(6):1–16, 2015.
[37] Tianyu Luan, Yali Wang, Junhao Zhang, Zhe Wang, Zhipeng Zhou, and Yu Qiao. Pc-hmr: Pose calibration for 3d human mesh recovery from 2d images/videos. In AAAI, pages 2269–2276, 2021.
[38] Dushyant Mehta, Helge Rhodin, Dan Casas, Pascal Fua, Oleksandr Sotnychenko, Weipeng Xu, and Christian Theobalt. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV, pages 506–516, 2017.
[39] Gyeongsik Moon and Kyoung Mu Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In ECCV, pages 752–768, 2020.
[40] Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 3DV, pages 484–494. IEEE, 2018.
[41] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pages 10975–10985, 2019.
[42] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In CVPR, pages 459–468, 2018.
[43] Liliana Lo Presti and Marco La Cascia. 3d skeleton-based human action classification: A survey. Pattern Recognition, 53:130–147, 2016.
[44] Haibo Qiu, Chunyu Wang, **gdong Wang, Naiyan Wang, and Wenjun Zeng. Cross view fusion for 3d human pose estimation. In ICCV, pages 4342–4351, 2019.
[45] Jiajun Su, Chunyu Wang, Xiaoxuan Ma, Wenjun Zeng, and Yizhou Wang. Virtualpose: Learning generalizable 3d human pose models from virtual data. In ECCV, pages 55–71. Springer, 2022.
[46] Ke Sun, Bin Xiao, Dong Liu, and **gdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
[47] Xiao Sun, Bin Xiao, Fangyin Wei, Shuang Liang, and Yichen Wei. Integral human pose regression. In ECCV, pages 529–545, 2018.
[48] Yu Sun, Qian Bao, Wu Liu, Yili Fu, Michael J Black, and Tao Mei. Monocular, one-stage, regression of multiple 3d people. In ICCV, pages 11179–11188, 2021.
[49] Yu Sun, Yun Ye, Wu Liu, Wenpeng Gao, Yili Fu, and Tao Mei. Human mesh recovery from monocular images via a skeleton-disentangled representation. In ICCV, pages 5349–5358, 2019.
[50] Hanyue Tu, Chunyu Wang, and Wenjun Zeng. Voxelpose: Towards multi-camera 3d human pose estimation in wild environment. In ECCV, pages 197–212. Springer, 2020.
[51] Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragkiadaki. Self-supervised learning of motion capture. In NIPS, volume 30, 2017.
[52] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3d human body shapes. In ECCV, pages 20–36, 2018.
[53] Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In CVPR, pages 109–117, 2017.
[54] Timo von Marcard, Roberto Henschel, Michael J Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, pages 601–617, 2018.
[55] Ziniu Wan, Zhengjia Li, Maoqing Tian, Jianbo Liu, Shuai Yi, and Hongsheng Li. Encoder-decoder with multi-level attention for 3d human shape and pose estimation. In ICCV, pages 13033–13042, 2021.
[56] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, pages 52–67, 2018.
[57] Yuanlu Xu, Song-Chun Zhu, and Tony Tung. Denserac: Joint 3d pose and shape estimation by dense render-and-compare. In ICCV, pages 7760–7770, 2019.
[58] Chun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, and Ming-Hsuan Yang. Learning visibility for robust dense human body estimation. In ECCV, 2022.
[59] Hang Ye, Wentao Zhu, Chunyu Wang, Rujie Wu, and Yizhou Wang. Faster voxelpose: Real-time 3d human pose estimation by orthographic projection. In ECCV, pages 142–159. Springer, 2022.
[60] Andrei Zanfir, Elisabeta Marinoiu, and Cristian Sminchisescu. Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In CVPR, pages 2148–2157, 2018.
[61] Mihai Zanfir, Andrei Zanfir, Eduard Gabriel Bazavan, William T Freeman, Rahul Sukthankar, and Cristian Sminchisescu. Thundr: Transformer-based 3d human reconstruction with markers. In ICCV, pages 12971–12980, 2021.
[62] Wang Zeng, Wanli Ouyang, ** Luo, Wentao Liu, and Xiaogang Wang. 3d human mesh regression with dense correspondence. In CVPR, pages 7054–7063, 2020.
[63] Hongwen Zhang, Jie Cao, Guo Lu, Wanli Ouyang, and Zhenan Sun. Learning 3d human shape and pose from dense body parts. IEEE TPAMI, 44(5):2610–2627, 2022.
[64] Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In ICCV, pages 11446–11456, 2021.
[65] Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenyu Liu, and Wenjun Zeng. Voxeltrack: Multi-person 3d human pose estimation and tracking in the wild. IEEE TPAMI, 45(2):2613–2626, 2022.

Appendix

We elaborate on the post-processing implementation of the virtual markers and provide additional experimental details and results. At last, we discuss data from human subjects and the potential societal impact.

A. Post-processing on Virtual Markers

As described in Section 3.1, considering the left-right symmetric human body structure, we slightly adjust the learned virtual markers $\mathbf{Z}$ to be symmetric. In fact, after the first step that updates each $\mathbf{z}_{i}$ by its nearest vertex to get $\widetilde{\mathbf{Z}}\in\mathbb{R}^{3\times K}$ . $\widetilde{\mathbf{Z}}$ are almost symmetric with few exceptions. To get the final symmetric virtual markers $\widetilde{\mathbf{Z}}^{sym}\in\mathbb{R}^{3\times K}$ , for each virtual marker located in the left body part, we take its symmetric vertex in the right body to be its symmetric counterpart.

Since the human mesh (i.e. SMPL [36]) itself is not strictly symmetric, we clarify the symmetric vertex pair (e.g. left elbow and right elbow) on a human mesh template $\mathbf{X}^{t}\in\mathbb{R}^{3\times M}$ in Figure 9. We place $\mathbf{X}^{t}$ at the origin of the 3D coordinate system. Formally, we define the cost of matching $i^{th}$ vertex to $j^{th}$ vertex to be $\bm{C}_{i,j}=\left|x_{i}+x_{j}\right|+\left|y_{i}-y_{j}\right|+\left|z_{i}-z_{% j}\right|$ . A symmetric vertex pair $(\mathbf{X}^{t}_{i},\mathbf{X}^{t}_{j})$ is defined to have the minimal cost $\bm{C}_{i,j}$ . In this way, for each virtual marker in the left body, we take its symmetric vertex counterpart to be its symmetric virtual marker and finally get $\widetilde{\mathbf{Z}}^{sym}$ .

B. Experiments

In this section, we first add detailed descriptions for datasets and then provide more experimental results of our approach.

B.1 Datasets

H3.6M [16].

Following previous works [19, 25, 26, 42], we use the SMPL parameters generated from MoSh [35], which are fitted to the 3D physical marker locations, to get the GT 3D mesh supervision. Following standard practice [19], we evaluate the quality of 3D pose of $14$ joints derived from the estimated mesh, i.e. $\hat{\mathbf{M}}\mathcal{J}$ . We report Mean Per Joint Position Error (MPJPE) and PA-MPJPE in millimeters (mm). The latter uses Procrustes algorithm [13] to align the estimates to GT poses before computing MPJPE. To evaluate mesh estimation results, we also report Mean Per Vertex Error (MPVE) which can be interpreted as MPJPE computed over the whole mesh.

3DPW [54].

The 3D GT SMPL parameters are obtained by using the data from IMUs when collected. Following the previous works [32, 33, 24, 61], we use the train set of 3DPW to learn the model and evaluate on the test set.

MPI-INF-3DHP [38]

is a 3D pose dataset with 3D GT pose annotations. Since this dataset does not provide 3D mesh annotations, following [19, 25], we only enforce supervision on the 3D skeletons (Eq. (9)) in mesh losses.

UP-3D [28]

is a wild 2D pose dataset with natural images. The 3D poses and meshes are obtained by SMPLify [2]. Due to the lack of GT 3D poses, the fitted meshes are not accurate. Therefore we only use the 2D annotations to train the 3D virtual marker estimation network as in [47].

COCO [34]

is a large wild 2D pose dataset with natural images. Previous work [39] used SMPLify-X [41] to obtain pseudo SMPL mesh annotations but they are not accurate. However, we find that if we project the 3D mesh to 2D image, the resulting 2D mesh vertices align well with the image. So we leverage the 2D annotations to train the virtual marker estimation network as in [47].

SURREAL [53]

is a large-scale synthetic dataset containing 6 million frames of synthetic humans. The images are photo-realistic renderings of people under large variations in shape, texture, viewpoint, and body pose. To ensure realism, the synthetic bodies are created using the SMPL body model, whose parameters are fit by the MoSh [35] given raw 3D physical marker data. All the images have a resolution of $320\times 240$ . We use the same training split to train the model and evaluate the test split following [7].

B.2 Implementation Details and Computation Resource

Following common practice [19, 7, 39, 61, 26, 23, 33, 32], we conduct mix-training by using the above 2D and 3D datasets for experiments on the H3.6M and 3DPW datasets. To leverage the 3D pose estimation dataset, i.e. MPI-INF-3DHP [38], we extend the $64$ virtual markers with the $17$ landmark joints (i.e. skeleton) from the MPI-INF-3DHP dataset. For experiments on the SURREAL dataset, we use its training set alone as in [7, 37]. We implement the proposed method with PyTorch. All the experiments are conducted on a Linux machine with 4 NVIDIA 16GB V100 GPUs. The whole network is trained for 40 epochs with batch size 32 using Adam [22] optimizer.

We evaluate the model complexity in terms of FLOPs (G) and the number of model parameters in Table 7. Compared to the most recent state-of-the-art methods that directly regress all mesh vertices, such as I2L-MeshNet [39], METRO [32], and Mesh Graphormer [33], our approach with virtual marker representation reduces the computation overhead by a large margin while getting better estimation quality. The last column shows the MPVE errors on 3DPW test set for performance reference.

Methods	FLOPs (G) $\downarrow$	Params (M)	MPVE $\downarrow$
I2L-MeshNet [39] ECCV’20	28.7	141.2	110.1
METRO [32] CVPR’21	153.0	397.5	88.2
Mesh Graphormer [33] ICCV’21	48.8	180.6	87.7
Ours	22.1	109.6	77.9

Table 7: Computation overhead comparison with the recent state-of-the-art methods that directly regress all 3D vertices. The rightmost column shows the MPVE errors on the 3DPW test set for performance reference.

	Ours	w/o $\mathcal{L}_{conf}$	w/o $\mathcal{L}_{pose}$	w/o $\mathcal{L}_{normal}$	w/o $\mathcal{L}_{edge}$
MPVE $\downarrow$	58.0	59.2	58.3	60.6	60.4

Table 8: MPVE errors on H3.6M [16] test set when ablating different loss terms.

Occ. VM Parts	MPVE $\downarrow$	MPJPE $\downarrow$	PA-MPJPE $\downarrow$
None (Ours)	77.9	67.5	41.3
2 Arms	79.2 $\uparrow$ 1.3	68.2 $\uparrow$ 0.7	42.2 $\uparrow$ 0.9
2 Legs	78.3 $\uparrow$ 0.4	67.9 $\uparrow$ 0.4	41.7 $\uparrow$ 0.4
Body	78.6 $\uparrow$ 0.7	68.0 $\uparrow$ 0.5	41.8 $\uparrow$ 0.5
Random	78.7 $\uparrow$ 0.8	68.1 $\uparrow$ 0.6	41.9 $\uparrow$ 0.6

Table 9: Results on 3DPW [54] test set when different parts of virtual markers (VM) are occluded.

B.3 Additional Quantitative Results

Different loss terms.

Table 8 reports the MPVE error on H3.6M [16] test set when ablating different loss terms. The confidence loss [17] is used to encourage the interpretability of the heatmaps to have a maxima response at the GT position. Without the confidence loss, the error increases slightly. If ablating the surface losses, MPVE increases a lot, which demonstrates the smoothing effect of these two terms.

Robustness to occlusion.

We report results when different virtual markers are occluded by a synthetic mask in Table 9. The errors are slightly larger than the original image (None), which validates the effectiveness of the locality of the virtual marker representation. Occluding arm regions results in a larger error increase. This may be because the arm has larger variations in the dataset.

Comparison to fitting.

In order to disentangle the ability of mesh regression from markers using $\hat{\mathbf{A}}$ with the ability to detect the virtual markers accurately from images, we first compute the estimation errors of the virtual markers. The MPJPE over all the virtual markers is $35.5$ mm, which demonstrates that these virtual markers can be accurately detected from the images. We then fit the mesh model parameters to these virtual markers. Table 10 shows the metrics of the fitted mesh on the SURREAL [53] test set. As we can see, the fitted mesh has a similar error as our regression ones which uses the interpolation matrix $\hat{\mathbf{A}}$ , which validates the accuracy of the estimated virtual markers.

B.4 Additional Qualitative Results

Figure 12 shows more qualitative comparisons with Pose2Mesh [7] on the SURREAL test set in which has diverse body shapes. The skeleton representation used in Pose2Mesh loses the body shape information so the method [7] can only recover mean shapes. For example, in Figure 12 (d) (e), the estimated meshes of Pose2Mesh tend to have the average body shape and fail to estimate the real body shape, regardless of whether the person is slim or stout. This is caused by the limited skeleton representation bottleneck so that the model learns a mean shape for the whole training dataset implicitly. In contrast, our approach with virtual marker representation generates more accurate mesh estimation results.

Method	MPVE $\downarrow$	MPJPE $\downarrow$	PA-MPJPE $\downarrow$
Fitting	44.6	34.8	29.5
Ours	44.7	36.9	28.9

Table 10: Results on SURREAL [53] test set when the mesh is obtained by fitting to the estimated virtual markers.

Figure 13 shows more qualitative comparisons with Pose2Mesh [7] and METRO [32] on the 3DPW test set. Pose2Mesh and METRO use the skeleton or all 3D vertices as intermediate representations, respectively. The estimated meshes are overlaid on the images according to the camera parameters. Pose2Mesh [7] has difficulty in estimating correct body pose and shapes when truncation occurs (a) or in complex postures (c). The results of METRO [32] have many artifacts where the estimated mesh is not smooth, and they also fail to align the image well. In contrast, our method estimates more accurate human poses and shapes and has smooth human mesh results. In addition, it is more robust to truncation and occlusion and aligns the image better.

Figure 14 shows more quality results of our approach on the 3DPW [54], H3.6M [16], MPI-INF-3DHP [38], and COCO [34] datasets. Figure 10 shows more qualitative results on Internet images with challenging cases, such as extreme body shapes or complex poses. Our method generalizes well on the natural scenes. Figure 11 shows typical failure cases, including inaccurate shape estimation and interpenetration, which are mainly caused by inaccurate 3D virtual marker estimation when occlusion occurs. But as expected, the rest body parts are barely affected due to the local and sparse property of the virtual marker.

C. Human Subject Data

We use existing public datasets of human subjects in our experiments following their official licensing requirements. With proper usage, the proposed method could be beneficial to society (e.g. elderly fall detection).