DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Qingxuan Wu¹, Zhiyang Dou^1,2,∗, Sirui Xu³, Soshi Shimada⁴, Chen Wang¹, Zhengming Yu⁶,
Yuan Liu², Cheng Lin², Zeyu Cao⁵, Taku Komura², Vladislav Golyanik⁴,
Christian Theobalt⁴, Wen** Wang⁶, Lingjie Liu^1,
¹University of Pennsylvania, ²The University of Hong Kong,
³University of Illinois Urbana-Champaign, ⁴Max Planck Institute for Informatics,
⁵University of Cambridge, ⁶Texas A&M University Corresponding authors.

Abstract

Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf [97], introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

1 Introduction

Hand-face interaction is a common behavior observed up to 800 times per day across all ages and genders [98]. Therefore, faithfully recovering hand-face interactions with plausible deformations is an important task given its wide applications in AR/VR [87, 36, 113], character animation [88, 136], and human behavior analysis [66, 29, 74]. Given the speed requirement of downstream applications like AR/VR, fast and accurate 3D reconstruction of hand-face interactions is highly desirable. However, several challenges make monocular hand-face deformation and interaction recovery a cumbersome task: 1) the self-occlusion involved in hand-face interaction, 2) the diversity of hand and face poses, contacts, and deformations, and 3) ambiguity in the single-view setting. Most existing methods [91, 75] only reconstruct the hand [92] and face [57] meshes, or unified as a whole body [67, 82] with the body parts, without capturing contacts and deformations. A seminal advance, Decaf [97], recovers hand-face interactions with deformations and contacts taken into account. However, it requires time-consuming optimization, which takes more than 15 seconds per image, rendering it unsuitable for interactive applications. The iterative fitting process of Decaf relies on an accurate estimation of hand and face keypoints and contacts on the hand and face surfaces, which could fail when significant occlusion is present in the image; See Fig. 8 in the Appendix D.1. Additionally, Decaf cannot scale up their training to fruitful hand-face interaction data in the wild, as they require 3D ground-truth annotations, i.e., contact labels and deformations.

\begin{overpic}[width=433.62pt]{Figs/fig_teaser.pdf} \end{overpic}

Figure 1: Our method, DICE, is the first end-to-end approach that captures hand-face interaction and deformation from a monocular image. (a) Decaf validation dataset. (b) In-the-wild images. (c) Use-cases in VR.

To tackle the issues above, we present DICE, the first end-to-end approach for Deformation-aware hand-face Interaction reCovEry from a monocular image. We use a Transformer-based model with the attention mechanism to effectively capture hand-face relationships. Motivated by the global nature of the pose and shape of hand and face and the local nature of the deformation field and contact probabilities, we further propose disentangling the regression of deformation from the pose and shape of hand and face represented by mesh vertex positions into two network branches, which enhances the estimation of deformations and contacts while resulting in accurate and robust hand and face mesh recovery. Instead of directly regressing hand and face parameters, we learn an intermediate non-parametric mesh representation. We then use this representation to regress the pose and shape parameters of hand and face with a neural inverse-kinematics network. Compared with directly regressing the pose and shape parameters which learns the abstract parameters is a highly non-linear process and suffers from image-model misalignment, predicting vertex positions in Euclidean space and then applying inverse-kinematics enhances the reconstruction accuracy [53, 52, 51]. Consequently, our model achieves higher reconstruction accuracy than all previous regression and optimization-based methods. It also reaps the benefits of an animatable parametric hand and face representation that could be readily used by downstream applications.

Meanwhile, despite containing rich annotations, the existing benchmark dataset [97] collected in a studio is still limited in the diversity of hand motions, facial expressions, and appearances. Training a model only on such a dataset limits its generalization capability when applied to in-the-wild data. To achieve robust and generalizable hand-face interaction and deformation recovery, we introduce a weak-supervision training pipeline that utilizes in-the-wild images without the reliance on 3D annotations.

In addition to the 2D keypoint supervision for in-the-wild images, we propose a novel depth supervision pipeline. This pipeline leverages the robust depth prior from a diffusion-based monocular depth estimation model [44], which provides essential geometric information for accurate mesh recovery and captures spatial relationships critical for contact state and deformation estimation. To improve our model’s robustness, we further employ pose priors of the hand and face by introducing hand and face parameter discriminators that learn rich hand and face motion priors from multiple datasets on hand or face separately [79, 140]. By incorporating a small set of real-world images alongside the Decaf dataset and leveraging our weak-supervision pipeline, we markedly enhance the accuracy and generalization capacity of our model.

As a result, our method achieves superior performance in terms of accuracy, physical plausibility, inference speed, and generalizability. It surpasses all previous methods in accuracy on both standard benchmarks and challenging in-the-wild images. Fig. 1 visualizes some results of our method. We conduct extensive experiments to validate our method. In summary, our contribution is three-fold:

•

We propose DICE, the first end-to-end learning-based approach that accurately recovers hand-face interactions and deformations from a single image.
•

We propose a novel weak-supervised training scheme with depth supervision on keypoints to augment the Decaf data distribution with a diverse real-world data distribution, significantly improving the generalization ability.
•

DICE achieves superior reconstruction quality compared to baseline methods while running at an interactive rate (20 fps).

2 Related Work

Extensive efforts have been made to recover meshes from monocular images, including human bodies [2, 71, 53, 5, 14, 116, 112, 111, 64, 43, 4, 128, 25, 59, 110, 18, 11, 40, 63], hands [93, 73, 72, 70, 76, 81, 121, 120, 54, 126], and faces [24, 26, 115, 16, 139, 7, 134, 78, 35, 8, 48, 50]. This also includes recovering the surrounding environments [12, 39, 32, 33, 135, 58, 133, 96, 69, 114] and interacting objects [119, 129, 85, 104, 30, 100, 129, 27, 86, 34, 123, 9, 10, 65, 15] while reconstructing the mesh. The acquired versatile behaviors play a crucial role in various applications, including motion generation [101, 83, 80, 28, 108, 117, 118, 61, 138, 105, 84, 17, 106], augmented reality (AR), virtual reality (VR), and human behavior analysis [130, 122, 132, 131, 29, 66]. In the following, we mainly review the related works on hand, face and full-body mesh recovery.

3D Interacting Hands Recovery. Recent advancements have markedly enhanced the capture and recovery of 3D hand interactions. Early studies have achieved reconstruction of 3D hand-hand interactions utilizing a fitting framework, employing resources such as RGBD sequences [77], hand segmentation maps [74], and dense matching maps [107]. The introduction of large-scale datasets for interacting hands [73, 72] has motivated the development of regression-based approaches. Notably, these include regressing 3D interacting hand directly from monocular RGB images [93, 70, 127, 55, 141]. Additionally, research has extended to recovering interactions between hands and various objects in the environment, including rigid [6, 27, 65, 100, 20, 125, 124, 13], articulated [21], and deformable [103] objects. Following [97], our work distinguishes itself by introducing hand interactions with a deformable face, characterized by its non-uniform stiffness—a significant difference from conventional deformable models. This innovation presents unique challenges in accurately modeling interactions.

3D Human Face Recovery. Research in human face recovery encompasses both optimization-based [1, 102] and regression-based [26, 94] methodologies. Beyond mere geometry reconstruction, recent approaches have evolved to incorporate training networks with the integration of differentiable renderers [24, 139, 137, 109, 11]. These methods estimate variables such as lighting, albedo, and normals to generate facial images and compare them with the monocular input. However, a significant limitation in much of the existing literature is the neglect of the face’s deformable nature and hand-face interactions. Decaf [97] represents a pivotal development in this area, attempting to model the complex mimicry of musculature and the underlying skull anatomy through optimization techniques. In contrast, our work introduces a regression-based, end-to-end method for efficient problem-solving, setting a new benchmark in the field.

3D Full-Body Recovery. The task of monocular human pose and shape estimation involves reconstructing a 3D human body from a single-color image. Optimization-based approaches [2, 82, 95, 90] employ the SMPL model [67], fitting it to 2D keypoints detected within the image. Conversely, regression-based methods [53, 49, 45, 43, 23, 22, 62, 5, 25] leverage deep neural networks to directly infer the pose and shape parameters of the SMPL model. Hybrid methods [46] integrate both optimization and regression techniques, enhancing 3D model supervision. Distinct from these approaches, we follow parametric methods [53, 5, 43, 2] due to its flexibility for animation purposes. Unlike most research in this domain, which primarily concentrates on the main body with only rough estimations of hands and face, our methodology uniquely accounts for detailed interactions between these components.

3 Method

\begin{overpic}[width=433.62pt]{Figs/fig_pipeline_keypoints.pdf} \end{overpic}

Figure 2: Overview of the proposed DICE framework. The input image is first fed to a CNN to extract a feature map, which is then passed to the Transformer-based encoders for mesh and interaction, i.e., MeshNet and InteractionNet. The MeshNet extracts hand and face mesh features, which is then used by the Inverse Kinematics models (IKNets) to predict pose and shape parameters that drive FLAME [57] and MANO [92] models. The InteractionNet predicts per-vertex hand-face contact probabilities and face deform fields from the feature map, the latter is applied to the face mesh output by the FLAME model. To improve the generalization capability, we introduce a Weakly-Supervised Training Scheme using off-the-shelf 2D keypoint detection models [68, 3] and depth estimation models [44] to provide depth supervision on keypoints. In addition, we use head and hand discriminators to constrain the distribution of parameters regressed by IKNets.

Problem Formulation. Following Decaf [97], we adopt the FLAME [57] and MANO [92] parametric models for hand and face. Given a single RGB image $\mathbf{I}\in\mathbb{R}^{224\times 224\times 3}$ , the objective of this task is to reconstruct the vertices of hand mesh $\mathbf{V}_{H}\in\mathbb{R}^{778\times 3}$ and face mesh $\mathbf{V}_{F}\in\mathbb{R}^{5023\times 3}$ , along with capturing the face deformation vectors $\mathbf{D}\in\mathbb{R}^{5023\times 3}$ caused by hand-face interaction and its non-rigid nature, and per-vertex contact probabilities of hand $\mathbf{C}_{H}\in\mathbb{R}^{778}$ and face ${\mathbf{C}_{F}}\in\mathbb{R}^{5023}$ .

3.1 Transformer-based Hand-Face Interaction Recovery

Our model incorporates a two-branch Transformer architecture and integrates inverse-kinematic models, specifically MeshNet, InteractionNet, and IKNets. A differentiable renderer [89] is used to compute depth maps from the predicted mesh for depth supervision, and the hand and face discriminators are used as priors for constraining the hand and face poses; See Fig. 2 for an overview.

Given a monocular RGB image $\mathbf{I}$ , we use a pretrained HRNet-W64 [99] backbone to extract a feature map $\bm{X}_{\text{I}}\in\mathbb{R}^{H\times W\times C}$ , where $H,W$ are the spatial dimension and $C$ is the channel dimension. Following [63, 64], we flatten the image feature maps and upsample the $H\times W$ feature maps to $N$ feature maps, one for each keypoint and coarse vertex of both hand and face. We then concatenate the $\mathbf{F}^{\prime}\in\mathbb{R}^{N\times C}$ -dimensional feature maps with a $N\times 3$ -dimensional downsampled hand and face vertices and keypoints of the mean pose, as mesh vertices and joints queries for the transformer, resulting in a feature map $\mathbf{F}\in\mathbb{R}^{N\times(C+3)}$ . The mesh vertices and joints also serve as the positional encoding. To effectively model the interaction between the vertices, we mask the image feature maps corresponding to a random subset of vertices.

We propose to use two separate branches, MeshNet and InteractionNet, splitting the regression of mesh vertices and a deformation field for their semantic difference: the mesh positions are more global while the deformation vectors and contact states are relatively local. Specifically, the network is followed by two progressively downsampling transformers: MeshNet, which takes the feature map $\mathbf{F}$ as input and regresses the rough vertex positions of hand $\mathbf{V}_{H}^{\prime}$ and face, $\mathbf{V}_{F}^{\prime}$ ; and InteractionNet, which first downsamples the feature map $\mathbf{F}$ , then uses it to predict the 3D deformation field $\mathbf{D}$ at each face vertex along with the contact labels for each hand and face vertices, $\mathbf{C}_{H}$ and ${\mathbf{C}_{F}}$ . Note the contacts and deformations are regressed in the same encoder to model their close relationship: the contacts cause the deformations. We validate our design in Sec. 4.4.

Next, instead of directly regressing the hand and face vertices, we regress the pose and shape of the parametric hand and face models, making the output readily animatable for downstream applications. This is achieved by a neural inverse kinematics model, named IKNet, similar to Kolotouros et al. [47]. Our IKNet takes roughly estimated hand and face mesh vertices $\mathbf{V}_{H}^{\prime}$ and $\mathbf{V}_{F}^{\prime}$ as inputs and predict hand and face pose, shape and expression parameters $(\theta_{\text{h}},\beta_{\text{h}})$ , $(\theta_{\text{f-pose}},\beta_{\text{f}},\theta_{\text{f-exp}})$ , along with the position and orientation for hand [57] and face [92], respectively. We use the predicted parameters to first obtain the hand mesh and undeformed face mesh $\mathbf{V}_{H}$ , $\mathbf{V}_{F}^{*}$ . Then, we apply the deformation $\mathbf{D}$ predicted by the InteractionNet on $\mathbf{V}_{F}^{*}$ to get the final deformed face $\mathbf{V}_{F}$ . Regressing parameters offers several advantages: first, it enables readily animatable meshes; second, compared to non-parametric regression methods, where meshes typically contain artifacts such as spikes [63, 11, 64], the mesh quality is significantly improved; third, the compact parameter space facilitates a more effective discriminator, which will be discussed in the following section.

3.2 Weakly-Supervised Training Scheme

Although the aforementioned benchmark, Decaf [97], accurately captures hand, face, self-contact, and deformations, it consists of eight subjects and is recorded in a controlled environment with green screens. Training a model only with Decaf limits its generalization capability to in-the-wild images that have far more complex and diverse human identities, hand poses, and face poses.

To further enhance the generalization capability of our model, we train our model with $500$ diverse in-the-wild images of hand-face interaction collected from the internet without the reliance on the 3D ground truth annotations. First, we use 2D hand-and-face keypoints detected by [68] and [3].

Then, we propose to use Marigold [44], a diffusion-based monocular depth estimator pre-trained on a large number of images to generate 2D affine-invariant depth maps for supervision in the direction of depth (see Eq. 4). The depth supervision provides a strong depth prior, which guides the spatial relationship between hand and face meshes, promoting accurate modeling of hand-face interaction. For supervision, we use a differentiable rasterizer [89] to compute a depth map from the predicted hand and face meshes and supervise the network using a depth loss calculated between the depth values of hand and face keypoints and corresponding points on the predicted depth map. The introduced weak-supervision pipeline significantly enhances our model’s generalization capability and robustness, which we investigate in Sec. 4.4. In our experiment, we found that when training the model with only a small dataset of 500 images, we could significantly improve the model’s accuracy and generalization capability. Moreover, we train adversarial priors on the hand and face parameter space on multiple hand and face pose datasets: the face-only RenderMe-360 [79], the hand-only FreiHand [140], and Decaf [97]. This ensures the plausibility of generated face and hand poses and shapes while allowing for flexible poses and shapes beyond the Decaf data distribution to handle in-the-wild cases.

3.3 Loss Functions

Mesh losses $\mathcal{L_{\text{mesh}}}$ : For richly annotated data in Decaf [97], we employ $L_{1}$ loss for 3D keypoints, 3D vertices, and 2D reprojected keypoints against their respective ground-truths, following common practice in human- and hand-mesh recovery [63, 11, 18]. We further apply a $L_{1}$ loss $\mathcal{L}_{\text{params}}$ on the estimated hand and face pose, shape, and facial expression against the ground-truth parameters. For in-the-wild data, only the 2D reprojected keypoints are supervised, as this is the only type with corresponding ground truth.

Interaction losses $\mathcal{L_{\text{interaction}}}$ : For data in Decaf [97], we impose Chamfer Distance losses to enforce touch for predicted contact vertices and discourage collision. We also have a binary cross-entropy loss to supervise contact labels and a deform loss with adaptive weighting to supervise deform vectors, similar to [97]. For in-the-wild data, we also impose touch and collision losses since they do not require annotations.

Adversarial loss $\mathcal{L}_{\text{adv}}$ are applied to the predicted hand and face parameters for in-the-wild data to constrain their parameter space, and for decaf data to facilitate the training of the discriminators. The adversarial loss is given by:

\mathcal{L}_{\text{adv}}(E)=\mathbb{E}_{\theta_{F}\sim p_{E}}\left[\log\left(1% -D_{F}(E(I))\right)\right]+\mathbb{E}_{\theta_{H}\sim p_{E}}\left[\log\left(1-% D_{H}(E(I))\right)\right].

(1)

The losses for the hand and face discriminators are given by:

\mathcal{L}_{\text{adv}}(D_{F})=-\left(\mathbb{E}_{\theta_{F}\sim p_{E}}\left[% \log\left(1-D_{F}(E(I))\right)\right]+\mathbb{E}_{\theta_{F}\sim p_{\text{data% }}}\left[\log\left(D_{F}(\theta)\right)\right]\right)

(2)

and

\mathcal{L}_{\text{adv}}(D_{H})=-\left(\mathbb{E}_{\theta_{H}\sim p_{E}}\left[% \log\left(1-D_{H}(E(I))\right)\right]+\mathbb{E}_{\theta_{H}\sim p_{\text{data% }}}\left[\log\left(D_{H}(\theta)\right)\right]\right),

(3)

where $E$ jointly denotes the image backbone, the mesh encoder and the parameter regressor, $\theta_{F}=\text{concat}(\theta_{\text{face-shape}},\theta_{\text{face-jaw}},% \theta_{\text{face-exp}})$ , $\theta_{H}=\text{concat}(\theta_{\text{hand-shape}},\theta_{\text{hand-pose}})$ .

Depth loss $\mathcal{L}_{\text{depth}}$ : To provide pseudo-3D hand and face keypoints supervision for in-the-wild data, we use a modified SILog Loss [19], an affine-invariant depth loss as our depth supervision $\mathcal{L}_{\text{depth}}$ . Formally, let $\hat{K}_{D}$ denote the pseudo-ground-truth affine-invariant depth of the face and hand keypoints, and $K_{D}$ denote the rendered depth for the keypoints,

\mathcal{L}_{\text{depth}}=\left[\mathbf{Var}\left(\log(K_{D}+\varepsilon)-% \log(\hat{K}_{D}+\varepsilon)\right)\right]^{1/2},

(4)

where $\mathbf{Var}$ is the standard variance operator and $\varepsilon=10^{-7}$ .

Overall, our loss for the mesh and interaction networks is formulated by

\mathcal{L}=\lambda_{\text{mesh}}\mathcal{L_{\text{mesh}}}+\lambda_{\text{% interaction}}\mathcal{L_{\text{interaction}}}+\lambda_{\text{adv}}\mathcal{L_{% \text{adv}}}+\lambda_{\text{depth}}\mathcal{L_{\text{depth}}},

(5)

where $\lambda_{\text{mesh}}=12.5,\lambda_{\text{interaction}}=5,\lambda_{\text{depth% }}=2.5,\lambda_{\text{adv}}=1$ for all the experiments in the paper; See more details in Appendix C.

4 Experimental Results

4.1 Datasets and Metrics

Datasets

We employ Decaf [97] for reconstructing 3D face and hand interactions with deformations, along with the in-the-wild dataset we collected with 500 images. We use the hand and face shape, pose, and expression data from Decaf [97], RenderMe-360 [79], and FreiHand [140] for training the adversarial priors. We use the training set of the aforementioned datasets for network training. We utilize the Decaf test set for quantitative evaluation, and additionally, we visualize in-the-wild images from the test set for qualitative evaluation.

Metrics.

We adopt commonly-used metrics for mesh recovery accuracy following [43, 63, 18, 11]:
$\bullet$ Mean Per-Joint Position Error (MPJPE): the average Euclidean distance between predicted keypoints and ground-truth keypoints.
$\bullet$ PAMPJPE: MPJPE after Procrustes Analysis (PA) alignment.
$\bullet$ Per Vertex Error: per vertex error (PVE) with translation.

Following Decaf [97], we use the plausibility metrics mentioned below:
$\bullet$ Collision Distance (Col. Dist.): the average collision distances over vertices and frames;
$\bullet$ Non-Collision Ratio (Non. Col.): the proportion of frames without hand-face collisions;
$\bullet$ Touchness Ratio: the ratio of hand-face contacts among ground truth contacting frames;
$\bullet$ F-Score: the harmonic mean of Non-Collision Ratio and Touchness Ratio.

4.2 Implementation Details

We train the MeshNet, InteractionNet, and IKNet along with the face and hand discriminators with three identical AdamW optimizers with a learning rate of $6\times 10^{-4}$ , and a learning rate decay of $1\times 10^{-4}$ , optimizing in an alternating manner. Our batch size is set to $16$ during the training stage. Each training task takes $40$ epochs using $48$ hours. The model is trained and evaluated on 8 Nvidia A6000 GPUs with an AMD 128-core CPU. For a fair comparison, all baseline models used the settings in their original papers. Inference times are calculated on a single Nvidia A6000 GPU.

4.3 Performance on Hand-Face Interaction and Deformation Recovery

In addition to baselines considered in Decaf [97], we compare our method with a representative work in human body/hand mesh recovery, METRO, an end-to-end transformer-based model. For a fair comparison, we compare our method with a modified version of METRO [63] for predicting hand and face meshes, with extra output heads added to predict contact and deformation.

\begin{overpic}[width=433.62pt]{Figs/vis_contact.pdf} \end{overpic}

Figure 3: Qualitative results of hand-face interaction, deformation, and contact recovery by DICE on Decaf and in-the-wild images. In contact visualizations, a deeper color indicates a higher contact probability.

\begin{overpic}[width=325.215pt]{Figs/fig_compare_all} \end{overpic}

Figure 4: Qualitative comparsion of DICE, Decaf [97], PIXIE [23], METRO*[64] on Decaf validation set and in-the-wild images. Our method achieves superior reconstruction accuracy and plausibility in the Decaf [97] dataset, while generalizing well to difficult in-the-wild actions unseen in Decaf.

4.3.1 Quantitative Evaluations

Table 1: Comparison of hand-face interaction and deformation recovery on Decaf.

[b] Methods Type 3D Reconstruction Error Physics Plausibility Metrics Running Time (per image; s) $\downarrow$ PVE‡ $\downarrow$ MPJPE $\downarrow$ PAMPJPE $\downarrow$ Col. Dist. $\downarrow$ Non. Col. $\uparrow$ Touchness $\uparrow$ F-Score $\uparrow$ Decaf [97] O $9.65$ $-$ $-$ $1.03$ $83.6$ $96.6$ 89.6 $19.59$ Benchmark [68, 57] O $17.7$ $-$ $-$ $19.3$ $64.2$ $73.2$ $68.4$ $16.40$ PIXIE (hand+face) [23] O $26.3$ $-$ $-$ $7.04$ $75.9$ $75.1$ $75.5$ $-$ PIXIE (whole-body) [23] R $39.7$ $-$ $-$ $0.11$ $97.1$ $51.8$ $67.6$ $\underline{\textbf{0.070}}$ METRO* (hand+face) [63] R $11.8$ $15.4$ $11.9$ $0.08$ $80.7$ $54.8$ $65.2$ $0.103$ DICE (Ours) R $\underline{\textbf{8.32}}$ $\underline{\textbf{9.95}}$ $\underline{\textbf{7.27}}$ $0.16$ $66.6$ $79.9$ $\underline{72.7}$ $0.088$

•

* parametric version. O and R denote optimization-based and regression-based methods, respectively. ${\ddagger}$ calculated after translating the center of the head to the origin. bold and underline denote the overall best and the best among regression-based approaches, respectively. Note our method operates at an interactive rate (20 fps; 0.049s per image) on an Nvidia 4090 GPU. Here we report the runtime performance on a single A6000 GPU for a fair comparison.

Reconstruction Accuracy In Tab. 1, our method surpasses all baseline methods in terms of reconstruction accuracy, achieving a $7.5\%$ reduction in per-vertex error compared to the current state-of-the-art, Decaf. Note that our method is regression-based and allows inference at an interactive rate, while Decaf [97] uses a cumbersome test-time optimization process, taking more than $700$ x more time per image. Decaf also requires using temporal information in successive frames, while our method only uses a single frame. Our method shows a $30\%$ reduction in reconstruction error compared to the modified METRO baseline, and up to $79\%$ reduction compared to other end-to-end baselines.

Plausibility In addition to high accuracy, our method achieves the highest overall physical plausibility (F-Score) among all regression-based methods. Note that Touchness and Non-Collision ratio are complement to each other and are meaningless when considered individually, while F-Score measures the two values as a whole. Our method has a much lower interpenetration distance (Col. Dist.) compared to Benchmark and PIXIE (hand+face), which consider hand and face separately, therefore generating implausible interactions. Note that PIXIE (whole body) and METRO* show lower collision distances with a much lower Touchness than our method, indicating that the reconstructed hands and faces often appear incorrectly as if they are not interacting. On the other hand, our method shows low collision distance with a high Touchness, indicating plausible hand-face interaction reconstruction.

Contact Estimation In Tab. 2, DICE achieves superior contact estimation performance on Decaf dataset, surpassing previous work [97] in F-Score for both face and hand contacts. Here, the F-score provides a comprehensive measure of both the precision and the recall ratio combined. These two metrics are complementary and less meaningful when only considered individually; See Fig. 3 for qualitative results.

Table 2: Comparison of hand-face interaction and deformation recovery on Decaf.

Method	F-score $\uparrow$	Precision $\uparrow$	Recall $\uparrow$	Accuracy $\uparrow$
Decaf (face) [97]	$0.57$	$0.69$	$0.49$	$0.99$
Decaf (hand) [97]	$0.47$	$0.62$	$0.39$	$0.98$
DICE (face)	0.61	$0.64$	$0.57$	$1.00$
DICE (hand)	0.50	$0.55$	$0.45$	$0.98$

4.3.2 Qualitative Evaluations

As discussed in Sec. 3.2, the Decaf [97] dataset is collected in an indoor environment with a green screen, which doesn’t reflect the complex environment where real-world hand-face interactions occur. Therefore, a model only trained with the Decaf dataset might have generalization issues when tested on in-the-wild data. Fig. 4 confirms this result by our model’s superior generalization performance on in-the-wild data with unseen identity and pose. As shown in Fig. 3, our method faithfully reconstructs hand-face interaction and deformation and accurately labels the area of contact.

4.4 Ablation Study

In-the-wild data As shown in Tab. 3, adding weak-supervision training and in-the-wild data for DICE training improves all reconstruction error metrics (PVE*, MPJPE, PAMPJPE) while maintaining a high plausibility (F-Score). This is because the limited pose and identity distribution of the Decaf training dataset may cause the model to overfit, and the inclusion of in-the-wild images out of the Decaf data distribution effectively improves the generalization capability of DICE.

Depth Supervision Although depth supervision is only applied to in-the-wild data, as shown in Tab. 3, removing it also significantly affects performance on the Decaf validation set. Without depth loss, wrong predictions in depth are not penalized for in-the-wild data, introducing noise in the training process, and resulting in erroneous depth predictions in the Decaf dataset. As shown in Appendix Fig. 7, the absence of depth supervision introduces ambiguity in the z-direction, resulting in artifacts such as self-collision.

Adversarial Prior The adversarial prior incorporates diverse but realistic pose and shape distribution beyond Decaf [97], ensuring the reality of regressed mesh while allowing for generalization. As shown in Tab. 3, introducing adversarial supervision improves the accuracy and physical plausibility.

Parameter Supervision Supervising parameters directly, in addition to the indirect supervision of parameters by the mesh losses, improves both plausibility and accuracy. This is because direct parameter supervision eliminates ambiguity, without which the network may resort to other parameter combinations that produce incorrect meshes similar in terms of Euclidean distance.

Intermediate Supervision Removing the constraint that $\mathbf{V}_{F}^{\prime},\mathbf{V}_{H}^{\prime}$ being rough meshes and treating them only as feature maps results in a substantial drop in accuracy and a slight drop in plausibility (F-Score). Also, note that there is a sharp increase in collision distance, which is attributed to the spatial inaccuracy of the final output mesh. This indicates that using an intermediate mesh feature instead of an ordinary feature map increases the spatial accuracy of the output meshes, which also benefits plausibility.

Network Design In Tab. 3, adopting the two-branch architecture, which separates deformation and interaction estimation from mesh vertices regression, improves both accuracy and plausibility.

4.5 Limitations and Future Works

While our method achieves SotA accuracy on the Decaf [97] dataset and generalizes well to unseen scenes and in-the-wild cases, we still have failure cases when the hand-pose interactions are extremely challenging and have severe occlusions (see Appendix D.2). Moreover, while our method effectively recovers hand and face meshes with visually plausible deformations, there remains room for improvement in deformation accuracy and physical plausibility. In the future, physics-based simulation [37, 56, 38, 31, 60, 41] can be used as a stronger prior, producing more physically accurate estimations. In this paper, although we found using 500 in-the-wild images significantly improves the model’s generalization ability, scaling up to a larger amount of in-the-wild data, on the order of millions or billions, would further enhance performance, which we will study in future work.

Table 3: Comparison of hand-face interaction and deformation recovery on Decaf. Bold denotes the best result.

[b] Methods 3D Reconstruction Error Physics Plausibility Metrics PVE* $\downarrow$ MPJPE $\downarrow$ PAMPJPE $\downarrow$ Col. Dist. $\downarrow$ Non. Col. $\uparrow$ Touchness $\uparrow$ F-Score $\uparrow$ DICE (single branch) $9.29$ $11.6$ $8.51$ $0.04$ $87.4$ $57.4$ $69.3$ DICE (w.o. in-the-wild data) $8.93$ $11.0$ $7.50$ $0.11$ $74.6$ $71.9$ 73.3 DICE (w.o. $\mathcal{L}_{\text{depth}}$ ) $15.6$ $19.5$ $13.7$ $0.64$ $58.6$ $71.1$ $64.2$ DICE (w.o. $\mathcal{L}_{\text{params}}$ ) $10.3$ $12.8$ $10.4$ $0.08$ $80.9$ $53.9$ $64.7$ DICE (w.o. $\mathcal{L}_{\text{adv}}$ ) $11.1$ $14.2$ $10.4$ $0.05$ $82.1$ $60.7$ $69.8$ DICE (Full) 8.32 9.95 7.27 $0.16$ $66.6$ $79.9$ $72.7$

5 Conclusion

In this work, we present DICE, the first end-to-end approach for reconstructing 3D hand and face interaction with deformation from monocular images. Our approach features a two-branch transformer structure, MeshNet and InteractionNet, to model local deform field and global mesh geometry. An inverse-kinematic model, IKNet, is used to output the animatable parametric hand and face meshes. We also proposed a novel weak-supervision training pipeline, using a small amount of in-the-wild images and supervising with a depth prior and an adversarial loss to provide pose priors. Benefitting from our network design and training scheme, DICE demonstrates state-of-the-art accuracy and plausibility, compared with all previous methods. Meanwhile, our method achieves a fast inference speed (20 fps), allowing for more downstream interactive applications. In addition to strong performance on the standard benchmark, DICE also achieves superior generalization performance on in-the-wild data.

References

[1] Aldrian, O., Smith, W.A.: Inverse rendering of faces with a 3d morphable model. IEEE transactions on pattern analysis and machine intelligence 35(5), 1080–1093 (2012)
[2] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it smpl: Automatic estimation of 3d human pose and shape from a single image. In: European conference on computer vision. pp. 561–578. Springer (2016)
[3] Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In: International Conference on Computer Vision (2017)
[4] Cai, Z., Ren, D., Zeng, A., Lin, Z., Yu, T., Wang, W., Fan, X., Gao, Y., Yu, Y., Pan, L., et al.: Humman: Multi-modal 4d human dataset for versatile sensing and modeling. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII. pp. 557–577. Springer (2022)
[5] Cai, Z., Yin, W., Zeng, A., Wei, C., Sun, Q., Yanjun, W., Pang, H.E., Mei, H., Zhang, M., Zhang, L., et al.: Smpler-x: Scaling up expressive human pose and shape estimation. Advances in Neural Information Processing Systems 36 (2024)
[6] Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand-object interactions in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12417–12426 (2021)
[7] Chai, Z., Zhang, T., He, T., Tan, X., Baltrusaitis, T., Wu, H., Li, R., Zhao, S., Yuan, C., Bian, J.: Hiface: High-fidelity 3d face reconstruction by learning static and dynamic details. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9087–9098 (2023)
[8] Chatziagapi, A., Samaras, D.: Avface: Towards detailed audio-visual 4d face reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16878–16889 (2023)
[9] Chen, J., Yan, M., Zhang, J., Xu, Y., Li, X., Weng, Y., Yi, L., Song, S., Wang, H.: Tracking and reconstructing hand object interactions from point cloud sequences in the wild. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 304–312 (2023)
[10] Chen, Y., Tu, Z., Kang, D., Chen, R., Bao, L., Zhang, Z., Yuan, J.: Joint hand-object 3d reconstruction from a single image with cross-branch feature fusion. IEEE Transactions on Image Processing 30, 4008–4021 (2021)
[11] Cho, J., Youwang, K., Oh, T.H.: Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In: European Conference on Computer Vision. pp. 342–359. Springer (2022)
[12] Clever, H.M., Grady, P.L., Turk, G., Kemp, C.C.: Bodypressure-inferring body pose and contact pressure from a depth image. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(1), 137–153 (2022)
[13] Cong, P., Dou, Z.W., Ren, Y., Yin, W., Cheng, K., Sun, Y., Long, X., Zhu, X., Ma, Y.: Laserhuman: Language-guided scene-aware human motion generation in free environment. arXiv preprint arXiv:2403.13307 (2024)
[14] Contributors, M.: Openmmlab 3d human parametric model toolbox and benchmark
[15] Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F., Rogez, G.: Ganhand: Predicting human grasp affordances in multi-object scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5031–5041 (2020)
[16] Daněček, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face capture and animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20311–20322 (2022)
[17] Dou, Z., Chen, X., Fan, Q., Komura, T., Wang, W.: C· ase: Learning conditional adversarial skill embeddings for physics-based characters. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–11 (2023)
[18] Dou, Z., Wu, Q., Lin, C., Cao, Z., Wu, Q., Wan, W., Komura, T., Wang, W.: Tore: Token reduction for efficient human mesh recovery with transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15143–15155 (2023)
[19] Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems 27 (2014)
[20] Fan, Z., Parelli, M., Kadoglou, M.E., Kocabas, M., Chen, X., Black, M.J., Hilliges, O.: HOLD: Category-agnostic 3d reconstruction of interacting hands and objects from video. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
[21] Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M.J., Hilliges, O.: ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)
[22] Fang, Q., Shuai, Q., Dong, J., Bao, H., Zhou, X.: Reconstructing 3d human pose by watching humans in the mirror. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12814–12823 (2021)
[23] Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 2021 International Conference on 3D Vision (3DV). pp. 792–804 (2021)
[24] Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG) 40(4), 1–13 (2021)
[25] Feng, Y., Lin, J., Dwivedi, S.K., Sun, Y., Patel, P., Black, M.J.: Posegpt: Chatting about 3d human pose. arXiv preprint arXiv:2311.18836 (2023)
[26] Feng, Y., Wu, F., Shao, X., Wang, Y., Zhou, X.: Joint 3d face reconstruction and dense alignment with position map regression network. In: Proceedings of the European conference on computer vision (ECCV). pp. 534–551 (2018)
[27] Grady, P., Tang, C., Twigg, C.D., Vo, M., Brahmbhatt, S., Kemp, C.C.: Contactopt: Optimizing contact to improve grasps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1471–1481 (2021)
[28] Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5152–5161 (2022)
[29] Guo, Y., Dou, Z., Zhang, N., Liu, X., Su, B., Li, Y., Zhang, Y.: Student close contact behavior and covid-19 transmission in china’s classrooms. PNAS nexus 2(5), pgad142 (2023)
[30] Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: A method for 3d annotation of hand and object poses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3196–3206 (2020)
[31] Han, X., Gast, T.F., Guo, Q., Wang, S., Jiang, C., Teran, J.: A hybrid material point method for frictional contact with diverse materials 2(2) (2019). https://doi.org/10.1145/3340258, https://doi.org/10.1145/3340258
[32] Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3d human pose ambiguities with 3d scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2282–2292 (2019)
[33] Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3d scenes by learning human-scene interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14708–14718 (2021)
[34] Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M.J., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11807–11816 (2019)
[35] He, S., He, H., Yang, S., Wu, X., Xia, P., Yin, B., Liu, C., Dai, L., Xu, C.: Speech4mesh: Speech-assisted monocular 3d facial reconstruction for speech-driven 3d facial animation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14192–14202 (2023)
[36] Hu, L., Saito, S., Wei, L., Nagano, K., Seo, J., Fursund, J., Sadeghi, I., Sun, C., Chen, Y.C., Li, H.: Avatar digitization from a single image for real-time rendering. ACM Transactions on Graphics (ToG) 36(6), 1–14 (2017)
[37] Hu, Y., Fang, Y., Ge, Z., Qu, Z., Zhu, Y., Pradhana, A., Jiang, C.: A moving least squares material point method with displacement discontinuity and two-way rigid body coupling. ACM Transactions on Graphics (TOG) 37(4), 1–14 (2018)
[38] Hu, Y., Li, T.M., Anderson, L., Ragan-Kelley, J., Durand, F.: Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics (TOG) 38(6), 201 (2019)
[39] Huang, B., Pan, L., Yang, Y., Ju, J., Wang, Y.: Neural mocon: Neural motion control for physically plausible human motion capture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6417–6426 (2022)
[40] Huang, C.H.P., Yi, H., Höschle, M., Safroshkin, M., Alexiadis, T., Polikovsky, S., Scharstein, D., Black, M.J.: Capturing and inferring dense full-body human-scene contact. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13274–13285 (2022)
[41] Huang, K., Chitalu, F.M., Lin, H., Komura, T.: Gipc: Fast and stable gauss-newton optimization of ipc barrier energy 43(2) (2024). https://doi.org/10.1145/3643028, https://doi.org/10.1145/3643028
[42] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp. 448–456. pmlr (2015)
[43] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7122–7131 (2018)
[44] Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
[45] Kocabas, M., Huang, C.H.P., Tesch, J., Müller, L., Hilliges, O., Black, M.J.: Spec: Seeing people in the wild with an estimated camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11035–11045 (2021)
[46] Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2252–2261 (2019)
[47] Kolotouros, N., Pavlakos, G., Daniilidis, K.: Convolutional mesh regression for single-image human shape reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4501–4510 (2019)
[48] Kumar, R., Luo, J., Pang, A., Davis, J.: Disjoint pose and shape for 3d face reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3115–3125 (2023)
[49] Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: Closing the loop between 3d and 2d human representations. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6050–6059 (2017)
[50] Li, C., Morel-Forster, A., Vetter, T., Egger, B., Kortylewski, A.: Robust model-based face reconstruction through weakly-supervised outlier segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 372–381 (2023)
[51] Li, J., Bian, S., Liu, Q., Tang, J., Wang, F., Lu, C.: Niki: Neural inverse kinematics with invertible neural networks for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12933–12942 (2023)
[52] Li, J., Bian, S., Xu, C., Chen, Z., Yang, L., Lu, C.: Hybrik-x: Hybrid analytical-neural inverse kinematics for whole-body mesh recovery. arXiv preprint arXiv:2304.05690 (2023)
[53] Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3383–3393 (2021)
[54] Li, K., Yang, L., Zhen, H., Lin, Z., Zhan, X., Zhong, L., Xu, J., Wu, K., Lu, C.: Chord: Category-level hand-held object reconstruction via shape deformation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9444–9454 (2023)
[55] Li, M., An, L., Zhang, H., Wu, L., Chen, F., Yu, T., Liu, Y.: Interacting attention graph for single image two-hand reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2761–2770 (2022)
[56] Li, M., Ferguson, Z., Schneider, T., Langlois, T., Zorin, D., Panozzo, D., Jiang, C., Kaufman, D.M.: Incremental potential contact: intersection-and inversion-free, large-deformation dynamics 39(4) (2020). https://doi.org/10.1145/3386569.3392425, https://doi.org/10.1145/3386569.3392425
[57] Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36(6), 194–1 (2017)
[58] Li, Z., Shimada, S., Schiele, B., Theobalt, C., Golyanik, V.: Mocapdeform: Monocular 3d human motion capture in deformable scenes. In: 2022 International Conference on 3D Vision (3DV). pp. 1–11. IEEE (2022)
[59] Li, Z., Liu, J., Zhang, Z., Xu, S., Yan, Y.: Cliff: Carrying location information in full frames into human pose and shape estimation. In: European Conference on Computer Vision. pp. 590–606. Springer (2022)
[60] Lin, H., Chitalu, F.M., Komura, T.: Isotropic arap energy using cauchy-green invariants 41(6) (2022). https://doi.org/10.1145/3550454.3555507, https://doi.org/10.1145/3550454.3555507
[61] Lin, J., Zeng, A., Lu, S., Cai, Y., Zhang, R., Wang, H., Zhang, L.: Motion-x: A large-scale 3d expressive whole-body human motion dataset. Advances in Neural Information Processing Systems 36 (2024)
[62] Lin, J., Zeng, A., Wang, H., Zhang, L., Li, Y.: One-stage 3d whole-body mesh recovery with component aware transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21159–21168 (2023)
[63] Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1954–1963 (2021)
[64] Lin, K., Wang, L., Liu, Z.: Mesh graphormer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12939–12948 (2021)
[65] Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3d hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14687–14697 (2021)
[66] Liu, X., Dou, Z., Wang, L., Su, B., **, T., Guo, Y., Wei, J., Zhang, N.: Close contact behavior-based covid-19 transmission and interventions in a subway system. Journal of Hazardous Materials 436, 129233 (2022)
[67] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 851–866 (2023)
[68] Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., Zhang, F., Chang, C.L., Yong, M., Lee, J., et al.: Mediapipe: A framework for perceiving and processing reality. In: Third workshop on computer vision for AR/VR at IEEE computer vision and pattern recognition (CVPR). vol. 2019 (2019)
[69] Luo, Z., Iwase, S., Yuan, Y., Kitani, K.: Embodied scene-aware human pose estimation. Advances in Neural Information Processing Systems 35, 6815–6828 (2022)
[70] Moon, G.: Bringing inputs to shared domains for 3d interacting hands recovery in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17028–17037 (2023)
[71] Moon, G., Lee, K.M.: I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: European Conference on Computer Vision. pp. 752–768. Springer (2020)
[72] Moon, G., Saito, S., Xu, W., Joshi, R., Buffalini, J., Bellan, H., Rosen, N., Richardson, J., Mize, M., De Bree, P., et al.: A dataset of relighted 3d interacting hands. Advances in Neural Information Processing Systems 36 (2024)
[73] Moon, G., Yu, S.I., Wen, H., Shiratori, T., Lee, K.M.: Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. pp. 548–564. Springer (2020)
[74] Mueller, F., Davis, M., Bernard, F., Sotnychenko, O., Verschoor, M., Otaduy, M.A., Casas, D., Theobalt, C.: Real-time pose and shape reconstruction of two interacting hands with a single depth camera. ACM Transactions on Graphics (ToG) 38(4), 1–13 (2019)
[75] Muller, L., Osman, A.A., Tang, S., Huang, C.H.P., Black, M.J.: On self-contact and human pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9990–9999 (2021)
[76] Oh, Y., Park, J., Kim, J., Moon, G., Lee, K.M.: Recovering 3d hand mesh sequence from a single blurry image: A new dataset and temporal unfolding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 554–563 (2023)
[77] Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of two strongly interacting hands. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 1862–1869. IEEE (2012)
[78] Otto, C., Chandran, P., Zoss, G., Gross, M., Gotardo, P., Bradley, D.: A perceptual shape loss for monocular 3d face reconstruction. In: Computer Graphics Forum. vol. 42, p. e14945. Wiley Online Library (2023)
[79] Pan, D., Zhuo, L., Piao, J., Luo, H., Cheng, W., Wang, Y., Fan, S., Liu, S., Yang, L., Dai, B., Liu, Z., Loy, C.C., Qian, C., Wu, W., Lin, D., Lin, K.Y.: Renderme-360: Large digital asset library and benchmark towards high-fidelity head avatars. In: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
[80] Pan, L., Wang, J., Huang, B., Zhang, J., Wang, H., Tang, X., Wang, Y.: Synthesizing physically plausible human motions in 3d scenes. arXiv preprint arXiv:2308.09036 (2023)
[81] Park, J., Oh, Y., Moon, G., Choi, H., Lee, K.M.: Handoccnet: Occlusion-robust 3d hand mesh estimation network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1496–1505 (2022)
[82] Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10975–10985 (2019)
[83] Peng, X.B., Guo, Y., Halper, L., Levine, S., Fidler, S.: Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics (TOG) 41(4), 1–17 (2022)
[84] Peng, X.B., Ma, Z., Abbeel, P., Levine, S., Kanazawa, A.: Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG) 40(4), 1–20 (2021)
[85] Pham, T.H., Kyriazis, N., Argyros, A.A., Kheddar, A.: Hand-object contact force estimation from markerless visual tracking. IEEE transactions on pattern analysis and machine intelligence 40(12), 2883–2896 (2017)
[86] Pokhariya, C., Shah, I.N., Xing, A., Li, Z., Chen, K., Sharma, A., Sridhar, S.: Manus: Markerless hand-object grasp capture using articulated 3d gaussians. arXiv preprint arXiv:2312.02137 (2023)
[87] Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the European conference on computer vision (ECCV). pp. 818–833 (2018)
[88] Qin, D., Saito, J., Aigerman, N., Groueix, T., Komura, T.: Neural face rigging for animating and retargeting facial meshes in the wild. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–11 (2023)
[89] Ravi, N., Reizenstein, J., Novotny, D., Gordon, T., Lo, W.Y., Johnson, J., Gkioxari, G.: Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501 (2020)
[90] Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11488–11499 (October 2021)
[91] Rempe, D., Guibas, L.J., Hertzmann, A., Russell, B., Villegas, R., Yang, J.: Contact and human dynamics from monocular video. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. pp. 71–87. Springer (2020)
[92] Romero, J., Tzionas, D., Black, M.J.: Embodied hands: Modeling and capturing hands and bodies together. arXiv preprint arXiv:2201.02610 (2022)
[93] Rong, Y., Wang, J., Liu, Z., Loy, C.C.: Monocular 3d reconstruction of interacting hands via collision-aware factorized refinements. In: 2021 International Conference on 3D Vision (3DV). pp. 432–441. IEEE (2021)
[94] Sanyal, S., Bolkart, T., Feng, H., Black, M.J.: Learning to regress 3d face shape and expression from an image without 3d supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7763–7772 (2019)
[95] Shi, M., Starke, S., Ye, Y., Komura, T., Won, J.: Phasemp: Robust 3d pose estimation via phase-conditioned human motion prior. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14725–14737 (2023)
[96] Shimada, S., Golyanik, V., Li, Z., Pérez, P., Xu, W., Theobalt, C.: Hulc: 3d human motion capture with pose manifold sampling and dense contact guidance. In: European Conference on Computer Vision. pp. 516–533. Springer (2022)
[97] Shimada, S., Golyanik, V., Pérez, P., Theobalt, C.: Decaf: Monocular deformation capture for face and hand interactions. ACM Transactions on Graphics (TOG) 42(6), 1–16 (2023)
[98] Spille, J.L., Grunwald, M., Martin, S., Mueller, S.M.: Stop touching your face! a systematic review of triggers, characteristics, regulatory functions and neuro-physiology of facial self-touch. Neuroscience & Biobehavioral Reviews 128, 102–116 (Sep 2021). https://doi.org/10.1016/j.neubiorev.2021.05.030
[99] Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5693–5703 (2019)
[100] Tekin, B., Bogo, F., Pollefeys, M.: H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4511–4520 (2019)
[101] Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
[102] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2387–2395 (2016)
[103] Tretschk, E., Kairanda, N., BR, M., Dabral, R., Kortylewski, A., Egger, B., Habermann, M., Fua, P., Theobalt, C., Golyanik, V.: State of the art in dense monocular non-rigid 3d reconstruction. In: Computer Graphics Forum. vol. 42, pp. 485–520. Wiley Online Library (2023)
[104] Tsoli, A., Argyros, A.A.: Joint 3d tracking of a deformable object in interaction with a hand. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 484–500 (2018)
[105] Wan, W., Dou, Z., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Tlcontrol: Trajectory and language control for human motion synthesis. arXiv preprint arXiv:2311.17135 (2023)
[106] Wan, W., Huang, Y., Wu, S., Komura, T., Wang, W., Jayaraman, D., Liu, L.: Diffusionphase: Motion diffusion in frequency domain. arXiv preprint arXiv:2312.04036 (2023)
[107] Wang, J., Mueller, F., Bernard, F., Sorli, S., Sotnychenko, O., Qian, N., Otaduy, M.A., Casas, D., Theobalt, C.: Rgb2hands: real-time tracking of 3d hand interactions from monocular rgb video. ACM Transactions on Graphics (ToG) 39(6), 1–16 (2020)
[108] Wang, J., Rong, Y., Liu, J., Yan, S., Lin, D., Dai, B.: Towards diverse and natural scene-aware 3d human motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20460–20469 (2022)
[109] Wang, L., Chen, Z., Yu, T., Ma, C., Li, L., Liu, Y.: Faceverse: a fine-grained and detail-controllable 3d face morphable model from a hybrid dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20333–20342 (2022)
[110] Wang, W., Ge, Y., Mei, H., Cai, Z., Sun, Q., Wang, Y., Shen, C., Yang, L., Komura, T.: Zolly: Zoom focal length correctly for perspective-distorted human mesh reconstruction. arXiv preprint arXiv:2303.13796 (2023)
[111] Wang, Y., Sun, Q., Wang, W., Ling, J., Cai, Z., Xie, R., Song, L.: Learning dense uv completion for human mesh recovery. arXiv preprint arXiv:2307.11074 (2023)
[112] Wang, Y., Daniilidis, K.: Refit: Recurrent fitting network for 3d human recovery. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14644–14654 (2023)
[113] Wei, S.E., Saragih, J., Simon, T., Harley, A.W., Lombardi, S., Perdoch, M., Hypes, A., Wang, D., Badino, H., Sheikh, Y.: Vr facial animation via multiview image translation. ACM Transactions on Graphics (TOG) 38(4), 1–16 (2019)
[114] Weng, Z., Yeung, S.: Holistic 3d human and scene mesh estimation from single view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 334–343 (2021)
[115] Wood, E., Baltrušaitis, T., Hewitt, C., Johnson, M., Shen, J., Milosavljević, N., Wilde, D., Garbin, S., Sharp, T., Stojiljković, I., et al.: 3d face reconstruction with dense landmarks. In: European Conference on Computer Vision. pp. 160–177. Springer (2022)
[116] Xie, X., Bhatnagar, B.L., Pons-Moll, G.: Chore: Contact, human and object reconstruction from a single rgb image. In: European Conference on Computer Vision. pp. 125–145. Springer (2022)
[117] Xu, S., Li, Z., Wang, Y.X., Gui, L.Y.: Interdiff: Generating 3d human-object interactions with physics-informed diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14928–14940 (2023)
[118] Xu, S., Wang, Z., Wang, Y.X., Gui, L.Y.: Interdreamer: Zero-shot text to 3d dynamic human-object interaction. arXiv preprint arXiv:2403.19652 (2024)
[119] Yang, L., Li, K., Zhan, X., Lv, J., Xu, W., Li, J., Lu, C.: Artiboost: Boosting articulated 3d hand-object pose estimation via online exploration and synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2750–2760 (2022)
[120] Yang, L., Li, K., Zhan, X., Wu, F., Xu, A., Liu, L., Lu, C.: Oakink: A large-scale knowledge repository for understanding hand-object interaction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20953–20962 (2022)
[121] Yang, L., Zhan, X., Li, K., Xu, W., Li, J., Lu, C.: Cpf: Learning a contact potential field to model the hand-object interaction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11097–11106 (2021)
[122] Yang, X., Dou, Z., Ding, Y., Su, B., Qian, H., Zhang, N.: Analysis of sars-cov-2 transmission in airports based on real human close contact behaviors. Journal of Building Engineering 82, 108299 (2024)
[123] Ye, Y., Gupta, A., Tulsiani, S.: What’s in your hands? 3d reconstruction of generic objects in hands. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3895–3905 (2022)
[124] Ye, Y., Hebbar, P., Gupta, A., Tulsiani, S.: Diffusion-guided reconstruction of everyday hand-object interaction clips. In: ICCV (2023)
[125] Ye, Y., Li, X., Gupta, A., Mello, S.D., Birchfield, S., Song, J., Tulsiani, S., Liu, S.: Affordance diffusion: Synthesizing hand-object interactions. In: CVPR (2023)
[126] Yu, Z., Huang, S., Fang, C., Breckon, T.P., Wang, J.: Acr: Attention collaboration-based regressor for arbitrary two-hand reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12955–12964 (2023)
[127] Zhang, B., Wang, Y., Deng, X., Zhang, Y., Tan, P., Ma, C., Wang, H.: Interacting two-hand 3d pose and shape reconstruction from single color image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11354–11363 (2021)
[128] Zhang, H., Tian, Y., Zhou, X., Ouyang, W., Liu, Y., Wang, L., Sun, Z.: Pymaf: 3d human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11446–11456 (2021)
[129] Zhang, J.Y., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3d human-object spatial arrangements from a single image in the wild. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16. pp. 34–51. Springer (2020)
[130] Zhang, N., Liu, L., Dou, Z., Liu, X., Yang, X., Miao, D., Guo, Y., Gu, S., Li, Y., Qian, H., et al.: Close contact behaviors of university and school students in 10 indoor environments. Journal of Hazardous Materials 458, 132069 (2023)
[131] Zhang, N., Liu, X., Gao, S., Su, B., Dou, Z.: Popularization of high-speed railway reduces the infection risk via close contact route during journey. Sustainable Cities and Society 99, 104979 (2023)
[132] Zhang, N., Yang, X., Su, B., Dou, Z.: Analysis of sars-cov-2 transmission in a university classroom based on real human close contact behaviors. Science of The Total Environment 917, 170346 (2024)
[133] Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4d human body capture in 3d scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11343–11353 (2021)
[134] Zhang, T., Chu, X., Liu, Y., Lin, L., Yang, Z., Xu, Z., Cao, C., Yu, F., Zhou, C., Yuan, C., et al.: Accurate 3d face reconstruction with facial component tokens. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9033–9042 (2023)
[135] Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3d people in scenes without people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6194–6204 (2020)
[136] Zhao, Q., Long, P., Zhang, Q., Qin, D., Liang, H., Zhang, L., Zhang, Y., Yu, J., Xu, L.: Media2face: Co-speech facial animation generation with multi-modality guidance. arXiv preprint arXiv:2401.15687 (2024)
[137] Zheng, Y., Abrevaya, V.F., Bühler, M.C., Chen, X., Black, M.J., Hilliges, O.: Im avatar: Implicit morphable head avatars from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13545–13555 (2022)
[138] Zhou, W., Dou, Z., Cao, Z., Liao, Z., Wang, J., Wang, W., Liu, Y., Komura, T., Wang, W., Liu, L.: Emdm: Efficient motion diffusion model for fast, high-quality motion generation. arXiv preprint arXiv:2312.02256 (2023)
[139] Zielonka, W., Bolkart, T., Thies, J.: Towards metrical reconstruction of human faces. In: European Conference on Computer Vision. pp. 250–269. Springer (2022)
[140] Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 813–822 (2019)
[141] Zuo, B., Zhao, Z., Sun, W., Xie, W., Xue, Z., Wang, Y.: Reconstructing interacting hands with interaction prior from monocular images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9054–9064 (2023)

Appendix A Implementation Details

A.1 CNN Backbone

The CNN backbone used in our framework is an HRNet-W64 [99], initialized with ImageNet-pretrained weights. The weights of the backbone would be updated during training. We extract a ( $49\times H$ )-dim feature map from this network and upsamples it to a ( $N\times H$ )-dim feature map, where $N=N_{h_{k}}+N_{f_{k}}+N_{h_{v}}+N_{h_{v}}$ , the total number of head and hand keypoints $N_{h_{k}},N_{f_{k}}$ and vertices $N_{h_{v}},N_{f_{v}}$ . Then, we concatenate the keypoints and the vertices corresponding to the head and hand mean pose as keypoints and vertex queries, resulting in a ( $(N+3)\times H$ )-dim feature map. Random masking of keypoints and vertex queries of rate $30\%$ is applied, following [63].

A.2 MeshNet and InteractionNet

Our MeshNet and InteractionNet have similar progressive downsampling transformer encoder structures, see Fig. 5 for an illustration. The MeshNet has three component transformer encoders with decreasing feature dimensions. The InteractionNet starts with a fully connected layer that downsamples the feature dimension, followed by two transformer encoders. Each transformer encoder has a Multi-Head Attention module consisting of 4 layers and 4 attention heads. In addition to head and hand mesh features, MeshNet also regresses head and hand keypoints, which are only for supervision and not used by any downstream components.

\begin{overpic}[width=433.62pt]{Figs/ablation_transformer_encoder} \end{overpic}

Figure 5: Structural details of the MeshNet and InteractionNet. (a) MeshNet; (b) InteractionNet; (c) Internal structure of a Transformer Encoder block.

A.3 IKNet

Our IKNets take in rough mesh features $\mathbf{V}_{F}^{\prime},\mathbf{V}_{H}^{\prime}$ and output the pose and shape parameters $(\theta,\beta)$ , as well as the global rotation and translation $(R,T)$ . They feature a Multi-Layer Perceptron (MLP) structure, each consisting of five MLP Blocks and a final fully connected layer. Each MLP Block contains a fully connected layer, followed by a batch normalization layer [42] and a ReLU activation layer. There are two skip-connections, connecting the output of the first block with the input of the third block, and the output of the third block with the input of the final fully connected layer. See Fig. 6 for an illustration. The hand and head IKNets have the same structure, differing only in their input and output dimensions. The hidden dimensions of the two IKNets are 1024.

\begin{overpic}[width=173.44534pt]{Figs/ablation_ik_net} \end{overpic}

Figure 6: Structural details of the IKNet.

A.4 Training and Testing Details

To be consistent with the training setting of Decaf¹¹1Confirmed by the authors of Decaf [97], in the Decaf dataset, we use all eight camera views and the subjects S2, S4, S5, S7, and S8 in the training data split for training. For testing, we use only the front view (view 108) and the subjects S1, S3, and S6 in the testing data split. The low, mid, and high-resolution head mesh consists of $559$ , $1675$ , and $5023$ vertices, respectively. The low and high-resolution hand mesh consists of $195$ and $778$ vertices, respectively. We use the middle-resolution head mesh and the high-resolution hand mesh as the inputs of head and hand IKNets.

Appendix B More Qualitative Comparisons

We demonstrate qualitatively the effect of the absence of the depth loss in Fig. 7. When trained without depth loss, the network is only supervised with 2D information on in-the-wild data, without any constraints in the z-direction. As a result, artifacts such as self-penetration frequently occur in this case. The introduction of depth loss eliminates this ambiguity, allowing the correct relative positioning of hand and face.

\begin{overpic}[width=303.53267pt]{Figs/fig_depth_ablation} \end{overpic}

Figure 7: Qualitative demonstration of the effects of the depth loss. The model generalizes poorly in the z-direction when trained without depth supervision.

Appendix C Addition details on Losses

Here, we provide the details of the mesh losses and the interaction losses. The details of the adversarial loss and the depth loss are already mentioned in the main paper.

C.1 Mesh losses

The mesh loss $\mathcal{L}_{\text{mesh}}$ consists of four components.

\mathcal{L}_{\text{mesh}}=\mathcal{L}_{\text{reproj}}+4\mathcal{L}_{\text{vert% }}+2\mathcal{L}_{\text{key}}+2\mathcal{L}_{\text{params}}.

(6)

Vertices Loss $L_{1}$ loss is used for predicted rough 3D face and hand vertices ${\mathbf{V}_{f}^{\prime}}$ , ${\mathbf{V}_{h}^{\prime}}$ , FLAME-regressed undeformed 3D face vertices ${{\mathbf{V}_{f}^{*}}}$ and MANO-regressed 3D hand vertices ${{\mathbf{V}_{h}}}$ against the ground-truth 3D undeformed face vertices $\mathbf{\hat{V}}_{f}$ and 3D hand vertices $\mathbf{\hat{V}}_{h}$ .

\mathcal{L}_{\text{vert}}=\lambda_{h}(\mu_{\text{nonpara}}\|\mathbf{V}_{h}^{% \prime}-\mathbf{\hat{V}}_{h}\|_{1}+\|\mathbf{V}_{h}-\mathbf{\hat{V}}_{h}\|_{1}% )+\lambda_{f}(\mu_{\text{nonpara}}\|\mathbf{V}_{f}^{\prime}-\mathbf{\hat{V}}_{% h}\|_{1}+\|\mathbf{V}_{f}^{*}-\mathbf{\hat{V}}_{f}\|_{1}),

(7)

where $\lambda_{h},\lambda_{f}$ are empirically set to $3$ and $1$ respectively. $\mu_{\text{nonpara}}$ is set to $4$ to emphasize the supervision on the more complex non-parametric mesh features.
Keypoints Loss We use $L_{1}$ loss for predicted rough 3D face and hand keypoints $\mathbf{K}_{f}^{\prime}$ , $\mathbf{K}_{h}^{\prime}$ , 3D face and hand keypoints extracted from rough mesh ${\mathbf{K}_{f_{\text{mesh}}}},{\mathbf{K}_{h_{\text{mesh}}}}$ , FLAME-regressed 3D face keypoints ${\mathbf{K}_{f}}$ and MANO-regressed 3D hand keypoints ${\mathbf{K}_{h}}$ against the ground-truth 3D undeformed face keypoints ${\mathbf{\hat{K}}_{f}}$ and 3D hand keypoints ${\mathbf{\hat{K}}_{f}}$ .

\mathcal{L}_{\text{key}}=\mu_{\text{nonpara}}(\|\mathbf{K}_{h}^{\prime}-% \mathbf{\hat{K}}_{h}\|_{1}+\|{\mathbf{K}_{h_{\text{mesh}}}}-\mathbf{\hat{K}}_{% h}\|_{1}+\|\mathbf{K}_{f}^{\prime}-\mathbf{\hat{K}}_{f}\|_{1}+\|{\mathbf{K}_{f% _{\text{mesh}}}}-\mathbf{\hat{K}}_{f}\|_{1})

(8)

+\|{\mathbf{K}_{f}}-{\mathbf{\hat{K}}_{f}}\|_{1}+\|{\mathbf{K}_{h}}-{\mathbf{% \hat{K}}_{h}}\|_{1},

(9)

where $\mu_{\text{nonpara}}$ is empirically set to $4$ , to put more weight on the non-parametric mesh with high degrees of freedom.
Reprojection loss $L_{1}$ loss is used for reprojected rough 3D face and hand keypoints $\mathbf{K}_{f}^{\prime}$ , $\mathbf{K}_{h}^{\prime}$ , 3D face and hand keypoints extracted from rough mesh ${\mathbf{K}_{f_{\text{mesh}}}},{\mathbf{K}_{h_{\text{mesh}}}}$ , FLAME-regressed 3D face keypoints ${\mathbf{\hat{K}}_{f}}$ and MANO-regressed 3D hand keypoints ${\mathbf{\hat{K}}_{h}}$ against the ground-truth face and hand 2D keypoints $\mathbf{\hat{K}}_{f_{\text{2D}}},\mathbf{\hat{K}}_{h_{\text{2D}}}$ .

\mathcal{L}_{\text{reproj}}=\lambda_{h}(\|\Pi(\mathbf{K}_{h}^{\prime})-\mathbf% {\hat{K}}_{h_{\text{2D}}}\|_{1}+\|\Pi({\mathbf{K}_{h_{\text{mesh}}}})-\mathbf{% \hat{K}}_{h_{\text{2D}}}\|_{1}+\|\Pi({\mathbf{K}_{h}})-\mathbf{\hat{K}}_{h_{% \text{2D}}}\|_{1})

(10)

+\lambda_{f}(\|\Pi(\mathbf{K}_{f}^{\prime})-\mathbf{\hat{K}}_{f_{\text{2D}}}\|% _{1}+\|\Pi({\mathbf{K}_{f_{\text{mesh}}}})-\mathbf{\hat{K}}_{f_{\text{2D}}}\|_% {1}+\|\Pi({\mathbf{K}_{f}})-\mathbf{\hat{K}}_{f_{\text{2D}}}\|_{1}),

(11)

where $\Pi$ is the learned camera projection function. $\lambda_{h},\lambda_{f}$ are set to $4$ and $1$ respectively.

Parameter loss We apply $L_{1}$ loss on the regressed hand and face pose, shape, and facial expression parameters against their respective ground truths.

\mathcal{L}_{\text{face-params}}=(\|\beta_{\text{f}}-\hat{\beta}_{\text{f}}\|_% {1}+\|\theta_{\text{f-exp}}-\hat{\theta}_{\text{f-exp}}\|_{1}+\|\theta_{\text{% f-pose}}-\hat{\theta}_{\text{f-pose}}\|_{1})/3

(12)

\mathcal{L}_{\text{hand-params}}=(\|\beta_{\text{h}}-\hat{\beta}_{\text{h}}\|_% {1}+\|\theta_{\text{h}}-\hat{\theta}_{\text{h}}\|_{1})/2

(13)

\mathcal{L}_{\text{params}}=\mathcal{L}_{\text{face-params}}+\mathcal{L}_{% \text{hand-params}}

(14)

C.2 Interaction losses

The interaction loss $\mathcal{L}_{\text{interaction}}$ consists of four components:

\mathcal{L}_{\text{interaction}}=0.2\mathcal{L}_{\text{touch}}+0.6\mathcal{L}_% {\text{contact}}+\mathcal{L}_{\text{collision}}+6\mathcal{L}_{\text{deform}}.

(15)

Deformation loss Due to the human anatomy, some vertices on the face are more easily deformed than other vertices. Therefore, we impose an adaptive weighting on each vertex and use square loss to penalize large deformation. We also have a regularization term to penalize extremely large deformations.

\mathcal{L}_{\text{deform}}=\sum_{i\in\mathcal{I}}(1+\mu\|\hat{d_{i}}\|_{2})\|% \hat{d_{i}}-d_{i}\|_{2}^{2}+\lambda\sum_{i\in\mathcal{L}}\|d_{i}\|,

(16)

where $\mathcal{I}$ is the set of indices of face vertices, $d_{i}$ , $\hat{d_{i}}$ are the predicted and ground truth deformation vector for index $i$ , and $\mathcal{L}=\{i\in\mathcal{I}:\|d_{i}\|_{2}>3cm\}$ the vertices of large deformations. $\mu$ and $\lambda$ are empirically set to be $5000$ , $100$ respectively.

Touch loss Let $\mathbf{V}_{F_{C}}$ and $\mathbf{V}_{H_{C}}$ denote the set of face and hand vertices that are predicted by the model to have contact probability greater than $0.5$ .

\mathcal{L}_{\text{touch}}=\text{CD}(\mathbf{V}_{F_{C}},\mathbf{V}_{H_{C}})+% \text{CD}(\mathbf{V}_{H_{C}},\mathbf{V}_{F_{C}}),

(17)

where $\text{CD}(X,Y)$ gives the mean Chamfer Distance (CD) between each point in $X$ to the closest point in $Y$ .

Collision loss Let $\mathbf{V}_{H_{\text{Col}}}$ denote the set of hand vertices that penetrates the face surface, $\mathbf{V}_{F}$ and $\mathbf{D}_{F}$ denote the predicted face mesh vertices and deformations.

\mathcal{L}_{\text{collision}}=\text{CD}(\mathbf{V}_{H_{\text{Col}}},\mathbf{V% }_{F}-\mathbf{D}_{F}).

(18)

Contact loss Let $\mathbf{C}_{H}$ and $\mathbf{C}_{F}$ denote the predicted hand and face contact probabilities and ${\mathbf{\hat{C}}_{H}}$ , ${\mathbf{\hat{C}}_{F}}$ denote the ground-truth contact labels.

\mathcal{L}_{\text{contact}}=\text{BCE}(\mathbf{C}_{H},{\mathbf{\hat{C}}_{H}})% +\text{BCE}({\mathbf{C}_{F}},{\mathbf{\hat{C}}_{F}}),

(19)

where BCE denotes the binary cross-entropy loss.

Appendix D More Discussions

D.1 Performance under Challenging Occlusion.

\begin{overpic}[width=433.62pt]{Figs/ablation_decaf_inaccurate.pdf} \end{overpic}

Figure 8: Examples of failed keypoint estimation in case of large self-occlusion. (a) input image; (b) inaccurate keypoint estimation by the same keypoint estimators used in Decaf [68, 3]; (c) reconstructed hand-face interaction by our method. (d) reconstructed hand-face interaction by Decaf.

As seen in Fig. 8, our end-to-end DICE method is robust under challenging self-occlusion cases, such as the hand covering more than half of the face. On the other hand, Decaf [97], which requires an initial keypoint prediction for test-time optimization, performs poorly in this situation.

D.2 Failure Cases

In Fig. 9, we demonstrate the failure cases of our method. As shown in Fig. 9 (a), when there is a complex interaction between the hand and face, such as the presence of a cleaning sponge, there is a drop in the reconstruction accuracy of the hand mesh recovery. Also, as in Fig. 9 (b), When the face completely occludes the hand, a highly challenging scenario unseen in the training data, our model could not faithfully reconstruct the hand position.

\begin{overpic}[width=433.62pt]{Figs/ablation_failure_cases.pdf} \end{overpic}

Figure 9: Samples of failure cases.