DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Qingxuan Wu1, Zhiyang Dou1,2,∗, Sirui Xu3, Soshi Shimada4, Chen Wang1, Zhengming Yu6,
Yuan Liu2, Cheng Lin2, Zeyu Cao5, Taku Komura2, Vladislav Golyanik4,
Christian Theobalt4, Wen** Wang6, Lingjie Liu1,
1University of Pennsylvania, 2The University of Hong Kong,
3University of Illinois Urbana-Champaign, 4Max Planck Institute for Informatics,
5University of Cambridge, 6Texas A&M University
Corresponding authors.

Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf [97], introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

1 Introduction

Hand-face interaction is a common behavior observed up to 800 times per day across all ages and genders [98]. Therefore, faithfully recovering hand-face interactions with plausible deformations is an important task given its wide applications in AR/VR [87, 36, 113], character animation [88, 136], and human behavior analysis [66, 29, 74]. Given the speed requirement of downstream applications like AR/VR, fast and accurate 3D reconstruction of hand-face interactions is highly desirable. However, several challenges make monocular hand-face deformation and interaction recovery a cumbersome task: 1) the self-occlusion involved in hand-face interaction, 2) the diversity of hand and face poses, contacts, and deformations, and 3) ambiguity in the single-view setting. Most existing methods [91, 75] only reconstruct the hand [92] and face [57] meshes, or unified as a whole body [67, 82] with the body parts, without capturing contacts and deformations. A seminal advance, Decaf [97], recovers hand-face interactions with deformations and contacts taken into account. However, it requires time-consuming optimization, which takes more than 15 seconds per image, rendering it unsuitable for interactive applications. The iterative fitting process of Decaf relies on an accurate estimation of hand and face keypoints and contacts on the hand and face surfaces, which could fail when significant occlusion is present in the image; See Fig. 8 in the Appendix D.1. Additionally, Decaf cannot scale up their training to fruitful hand-face interaction data in the wild, as they require 3D ground-truth annotations, i.e., contact labels and deformations.

\begin{overpic}[width=433.62pt]{Figs/fig_teaser.pdf} \end{overpic}
Figure 1: Our method, DICE, is the first end-to-end approach that captures hand-face interaction and deformation from a monocular image. (a) Decaf validation dataset. (b) In-the-wild images. (c) Use-cases in VR.

To tackle the issues above, we present DICE, the first end-to-end approach for Deformation-aware hand-face Interaction reCovEry from a monocular image. We use a Transformer-based model with the attention mechanism to effectively capture hand-face relationships. Motivated by the global nature of the pose and shape of hand and face and the local nature of the deformation field and contact probabilities, we further propose disentangling the regression of deformation from the pose and shape of hand and face represented by mesh vertex positions into two network branches, which enhances the estimation of deformations and contacts while resulting in accurate and robust hand and face mesh recovery. Instead of directly regressing hand and face parameters, we learn an intermediate non-parametric mesh representation. We then use this representation to regress the pose and shape parameters of hand and face with a neural inverse-kinematics network. Compared with directly regressing the pose and shape parameters which learns the abstract parameters is a highly non-linear process and suffers from image-model misalignment, predicting vertex positions in Euclidean space and then applying inverse-kinematics enhances the reconstruction accuracy [53, 52, 51]. Consequently, our model achieves higher reconstruction accuracy than all previous regression and optimization-based methods. It also reaps the benefits of an animatable parametric hand and face representation that could be readily used by downstream applications.

Meanwhile, despite containing rich annotations, the existing benchmark dataset [97] collected in a studio is still limited in the diversity of hand motions, facial expressions, and appearances. Training a model only on such a dataset limits its generalization capability when applied to in-the-wild data. To achieve robust and generalizable hand-face interaction and deformation recovery, we introduce a weak-supervision training pipeline that utilizes in-the-wild images without the reliance on 3D annotations.

In addition to the 2D keypoint supervision for in-the-wild images, we propose a novel depth supervision pipeline. This pipeline leverages the robust depth prior from a diffusion-based monocular depth estimation model [44], which provides essential geometric information for accurate mesh recovery and captures spatial relationships critical for contact state and deformation estimation. To improve our model’s robustness, we further employ pose priors of the hand and face by introducing hand and face parameter discriminators that learn rich hand and face motion priors from multiple datasets on hand or face separately [79, 140]. By incorporating a small set of real-world images alongside the Decaf dataset and leveraging our weak-supervision pipeline, we markedly enhance the accuracy and generalization capacity of our model.

As a result, our method achieves superior performance in terms of accuracy, physical plausibility, inference speed, and generalizability. It surpasses all previous methods in accuracy on both standard benchmarks and challenging in-the-wild images. Fig. 1 visualizes some results of our method. We conduct extensive experiments to validate our method. In summary, our contribution is three-fold:

  • We propose DICE, the first end-to-end learning-based approach that accurately recovers hand-face interactions and deformations from a single image.

  • We propose a novel weak-supervised training scheme with depth supervision on keypoints to augment the Decaf data distribution with a diverse real-world data distribution, significantly improving the generalization ability.

  • DICE achieves superior reconstruction quality compared to baseline methods while running at an interactive rate (20 fps).

2 Related Work

Extensive efforts have been made to recover meshes from monocular images, including human bodies [2, 71, 53, 5, 14, 116, 112, 111, 64, 43, 4, 128, 25, 59, 110, 18, 11, 40, 63], hands [93, 73, 72, 70, 76, 81, 121, 120, 54, 126], and faces [24, 26, 115, 16, 139, 7, 134, 78, 35, 8, 48, 50]. This also includes recovering the surrounding environments [12, 39, 32, 33, 135, 58, 133, 96, 69, 114] and interacting objects [119, 129, 85, 104, 30, 100, 129, 27, 86, 34, 123, 9, 10, 65, 15] while reconstructing the mesh. The acquired versatile behaviors play a crucial role in various applications, including motion generation [101, 83, 80, 28, 108, 117, 118, 61, 138, 105, 84, 17, 106], augmented reality (AR), virtual reality (VR), and human behavior analysis [130, 122, 132, 131, 29, 66]. In the following, we mainly review the related works on hand, face and full-body mesh recovery.

3D Interacting Hands Recovery. Recent advancements have markedly enhanced the capture and recovery of 3D hand interactions. Early studies have achieved reconstruction of 3D hand-hand interactions utilizing a fitting framework, employing resources such as RGBD sequences [77], hand segmentation maps [74], and dense matching maps [107]. The introduction of large-scale datasets for interacting hands [73, 72] has motivated the development of regression-based approaches. Notably, these include regressing 3D interacting hand directly from monocular RGB images [93, 70, 127, 55, 141]. Additionally, research has extended to recovering interactions between hands and various objects in the environment, including rigid [6, 27, 65, 100, 20, 125, 124, 13], articulated [21], and deformable [103] objects. Following [97], our work distinguishes itself by introducing hand interactions with a deformable face, characterized by its non-uniform stiffness—a significant difference from conventional deformable models. This innovation presents unique challenges in accurately modeling interactions.

3D Human Face Recovery. Research in human face recovery encompasses both optimization-based [1, 102] and regression-based [26, 94] methodologies. Beyond mere geometry reconstruction, recent approaches have evolved to incorporate training networks with the integration of differentiable renderers [24, 139, 137, 109, 11]. These methods estimate variables such as lighting, albedo, and normals to generate facial images and compare them with the monocular input. However, a significant limitation in much of the existing literature is the neglect of the face’s deformable nature and hand-face interactions. Decaf [97] represents a pivotal development in this area, attempting to model the complex mimicry of musculature and the underlying skull anatomy through optimization techniques. In contrast, our work introduces a regression-based, end-to-end method for efficient problem-solving, setting a new benchmark in the field.

3D Full-Body Recovery. The task of monocular human pose and shape estimation involves reconstructing a 3D human body from a single-color image. Optimization-based approaches [2, 82, 95, 90] employ the SMPL model [67], fitting it to 2D keypoints detected within the image. Conversely, regression-based methods [53, 49, 45, 43, 23, 22, 62, 5, 25] leverage deep neural networks to directly infer the pose and shape parameters of the SMPL model. Hybrid methods [46] integrate both optimization and regression techniques, enhancing 3D model supervision. Distinct from these approaches, we follow parametric methods [53, 5, 43, 2] due to its flexibility for animation purposes. Unlike most research in this domain, which primarily concentrates on the main body with only rough estimations of hands and face, our methodology uniquely accounts for detailed interactions between these components.

3 Method

\begin{overpic}[width=433.62pt]{Figs/fig_pipeline_keypoints.pdf} \end{overpic}
Figure 2: Overview of the proposed DICE framework. The input image is first fed to a CNN to extract a feature map, which is then passed to the Transformer-based encoders for mesh and interaction, i.e., MeshNet and InteractionNet. The MeshNet extracts hand and face mesh features, which is then used by the Inverse Kinematics models (IKNets) to predict pose and shape parameters that drive FLAME [57] and MANO [92] models. The InteractionNet predicts per-vertex hand-face contact probabilities and face deform fields from the feature map, the latter is applied to the face mesh output by the FLAME model. To improve the generalization capability, we introduce a Weakly-Supervised Training Scheme using off-the-shelf 2D keypoint detection models [68, 3] and depth estimation models [44] to provide depth supervision on keypoints. In addition, we use head and hand discriminators to constrain the distribution of parameters regressed by IKNets.

Problem Formulation. Following Decaf [97], we adopt the FLAME [57] and MANO [92] parametric models for hand and face. Given a single RGB image 𝐈224×224×3𝐈superscript2242243\mathbf{I}\in\mathbb{R}^{224\times 224\times 3}bold_I ∈ blackboard_R start_POSTSUPERSCRIPT 224 × 224 × 3 end_POSTSUPERSCRIPT, the objective of this task is to reconstruct the vertices of hand mesh 𝐕H778×3subscript𝐕𝐻superscript7783\mathbf{V}_{H}\in\mathbb{R}^{778\times 3}bold_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 778 × 3 end_POSTSUPERSCRIPT and face mesh 𝐕F5023×3subscript𝐕𝐹superscript50233\mathbf{V}_{F}\in\mathbb{R}^{5023\times 3}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5023 × 3 end_POSTSUPERSCRIPT, along with capturing the face deformation vectors 𝐃5023×3𝐃superscript50233\mathbf{D}\in\mathbb{R}^{5023\times 3}bold_D ∈ blackboard_R start_POSTSUPERSCRIPT 5023 × 3 end_POSTSUPERSCRIPT caused by hand-face interaction and its non-rigid nature, and per-vertex contact probabilities of hand 𝐂H778subscript𝐂𝐻superscript778\mathbf{C}_{H}\in\mathbb{R}^{778}bold_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 778 end_POSTSUPERSCRIPT and face 𝐂F5023subscript𝐂𝐹superscript5023{\mathbf{C}_{F}}\in\mathbb{R}^{5023}bold_C start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5023 end_POSTSUPERSCRIPT.

3.1 Transformer-based Hand-Face Interaction Recovery

Our model incorporates a two-branch Transformer architecture and integrates inverse-kinematic models, specifically MeshNet, InteractionNet, and IKNets. A differentiable renderer [89] is used to compute depth maps from the predicted mesh for depth supervision, and the hand and face discriminators are used as priors for constraining the hand and face poses; See Fig. 2 for an overview.

Given a monocular RGB image 𝐈𝐈\mathbf{I}bold_I, we use a pretrained HRNet-W64 [99] backbone to extract a feature map 𝑿IH×W×Csubscript𝑿Isuperscript𝐻𝑊𝐶\bm{X}_{\text{I}}\in\mathbb{R}^{H\times W\times C}bold_italic_X start_POSTSUBSCRIPT I end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT, where H,W𝐻𝑊H,Witalic_H , italic_W are the spatial dimension and C𝐶Citalic_C is the channel dimension. Following [63, 64], we flatten the image feature maps and upsample the H×W𝐻𝑊H\times Witalic_H × italic_W feature maps to N𝑁Nitalic_N feature maps, one for each keypoint and coarse vertex of both hand and face. We then concatenate the 𝐅N×Csuperscript𝐅superscript𝑁𝐶\mathbf{F}^{\prime}\in\mathbb{R}^{N\times C}bold_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT-dimensional feature maps with a N×3𝑁3N\times 3italic_N × 3-dimensional downsampled hand and face vertices and keypoints of the mean pose, as mesh vertices and joints queries for the transformer, resulting in a feature map 𝐅N×(C+3)𝐅superscript𝑁𝐶3\mathbf{F}\in\mathbb{R}^{N\times(C+3)}bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × ( italic_C + 3 ) end_POSTSUPERSCRIPT. The mesh vertices and joints also serve as the positional encoding. To effectively model the interaction between the vertices, we mask the image feature maps corresponding to a random subset of vertices.

We propose to use two separate branches, MeshNet and InteractionNet, splitting the regression of mesh vertices and a deformation field for their semantic difference: the mesh positions are more global while the deformation vectors and contact states are relatively local. Specifically, the network is followed by two progressively downsampling transformers: MeshNet, which takes the feature map 𝐅𝐅\mathbf{F}bold_F as input and regresses the rough vertex positions of hand 𝐕Hsuperscriptsubscript𝐕𝐻\mathbf{V}_{H}^{\prime}bold_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and face, 𝐕Fsuperscriptsubscript𝐕𝐹\mathbf{V}_{F}^{\prime}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT; and InteractionNet, which first downsamples the feature map 𝐅𝐅\mathbf{F}bold_F, then uses it to predict the 3D deformation field 𝐃𝐃\mathbf{D}bold_D at each face vertex along with the contact labels for each hand and face vertices, 𝐂Hsubscript𝐂𝐻\mathbf{C}_{H}bold_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and 𝐂Fsubscript𝐂𝐹{\mathbf{C}_{F}}bold_C start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Note the contacts and deformations are regressed in the same encoder to model their close relationship: the contacts cause the deformations. We validate our design in Sec. 4.4.

Next, instead of directly regressing the hand and face vertices, we regress the pose and shape of the parametric hand and face models, making the output readily animatable for downstream applications. This is achieved by a neural inverse kinematics model, named IKNet, similar to Kolotouros et al. [47]. Our IKNet takes roughly estimated hand and face mesh vertices 𝐕Hsuperscriptsubscript𝐕𝐻\mathbf{V}_{H}^{\prime}bold_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐕Fsuperscriptsubscript𝐕𝐹\mathbf{V}_{F}^{\prime}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT as inputs and predict hand and face pose, shape and expression parameters (θh,βh)subscript𝜃hsubscript𝛽h(\theta_{\text{h}},\beta_{\text{h}})( italic_θ start_POSTSUBSCRIPT h end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT h end_POSTSUBSCRIPT ), (θf-pose,βf,θf-exp)subscript𝜃f-posesubscript𝛽fsubscript𝜃f-exp(\theta_{\text{f-pose}},\beta_{\text{f}},\theta_{\text{f-exp}})( italic_θ start_POSTSUBSCRIPT f-pose end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT f end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT f-exp end_POSTSUBSCRIPT ), along with the position and orientation for hand [57] and face [92], respectively. We use the predicted parameters to first obtain the hand mesh and undeformed face mesh 𝐕Hsubscript𝐕𝐻\mathbf{V}_{H}bold_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, 𝐕Fsuperscriptsubscript𝐕𝐹\mathbf{V}_{F}^{*}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Then, we apply the deformation 𝐃𝐃\mathbf{D}bold_D predicted by the InteractionNet on 𝐕Fsuperscriptsubscript𝐕𝐹\mathbf{V}_{F}^{*}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT to get the final deformed face 𝐕Fsubscript𝐕𝐹\mathbf{V}_{F}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Regressing parameters offers several advantages: first, it enables readily animatable meshes; second, compared to non-parametric regression methods, where meshes typically contain artifacts such as spikes [63, 11, 64], the mesh quality is significantly improved; third, the compact parameter space facilitates a more effective discriminator, which will be discussed in the following section.

3.2 Weakly-Supervised Training Scheme

Although the aforementioned benchmark, Decaf [97], accurately captures hand, face, self-contact, and deformations, it consists of eight subjects and is recorded in a controlled environment with green screens. Training a model only with Decaf limits its generalization capability to in-the-wild images that have far more complex and diverse human identities, hand poses, and face poses.

To further enhance the generalization capability of our model, we train our model with 500500500500 diverse in-the-wild images of hand-face interaction collected from the internet without the reliance on the 3D ground truth annotations. First, we use 2D hand-and-face keypoints detected by [68] and [3].

Then, we propose to use Marigold [44], a diffusion-based monocular depth estimator pre-trained on a large number of images to generate 2D affine-invariant depth maps for supervision in the direction of depth (see Eq. 4). The depth supervision provides a strong depth prior, which guides the spatial relationship between hand and face meshes, promoting accurate modeling of hand-face interaction. For supervision, we use a differentiable rasterizer [89] to compute a depth map from the predicted hand and face meshes and supervise the network using a depth loss calculated between the depth values of hand and face keypoints and corresponding points on the predicted depth map. The introduced weak-supervision pipeline significantly enhances our model’s generalization capability and robustness, which we investigate in Sec. 4.4. In our experiment, we found that when training the model with only a small dataset of 500 images, we could significantly improve the model’s accuracy and generalization capability. Moreover, we train adversarial priors on the hand and face parameter space on multiple hand and face pose datasets: the face-only RenderMe-360 [79], the hand-only FreiHand [140], and Decaf [97]. This ensures the plausibility of generated face and hand poses and shapes while allowing for flexible poses and shapes beyond the Decaf data distribution to handle in-the-wild cases.

3.3 Loss Functions

Mesh losses meshsubscriptmesh\mathcal{L_{\text{mesh}}}caligraphic_L start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT: For richly annotated data in Decaf [97], we employ L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss for 3D keypoints, 3D vertices, and 2D reprojected keypoints against their respective ground-truths, following common practice in human- and hand-mesh recovery [63, 11, 18]. We further apply a L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss paramssubscriptparams\mathcal{L}_{\text{params}}caligraphic_L start_POSTSUBSCRIPT params end_POSTSUBSCRIPT on the estimated hand and face pose, shape, and facial expression against the ground-truth parameters. For in-the-wild data, only the 2D reprojected keypoints are supervised, as this is the only type with corresponding ground truth.

Interaction losses interactionsubscriptinteraction\mathcal{L_{\text{interaction}}}caligraphic_L start_POSTSUBSCRIPT interaction end_POSTSUBSCRIPT: For data in Decaf [97], we impose Chamfer Distance losses to enforce touch for predicted contact vertices and discourage collision. We also have a binary cross-entropy loss to supervise contact labels and a deform loss with adaptive weighting to supervise deform vectors, similar to [97]. For in-the-wild data, we also impose touch and collision losses since they do not require annotations.

Adversarial loss advsubscriptadv\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT are applied to the predicted hand and face parameters for in-the-wild data to constrain their parameter space, and for decaf data to facilitate the training of the discriminators. The adversarial loss is given by:

adv(E)=𝔼θFpE[log(1DF(E(I)))]+𝔼θHpE[log(1DH(E(I)))].subscriptadv𝐸subscript𝔼similar-tosubscript𝜃𝐹subscript𝑝𝐸delimited-[]1subscript𝐷𝐹𝐸𝐼subscript𝔼similar-tosubscript𝜃𝐻subscript𝑝𝐸delimited-[]1subscript𝐷𝐻𝐸𝐼\mathcal{L}_{\text{adv}}(E)=\mathbb{E}_{\theta_{F}\sim p_{E}}\left[\log\left(1% -D_{F}(E(I))\right)\right]+\mathbb{E}_{\theta_{H}\sim p_{E}}\left[\log\left(1-% D_{H}(E(I))\right)\right].caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ( italic_E ) = blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_E ( italic_I ) ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_E ( italic_I ) ) ) ] . (1)

The losses for the hand and face discriminators are given by:

adv(DF)=(𝔼θFpE[log(1DF(E(I)))]+𝔼θFpdata[log(DF(θ))])subscriptadvsubscript𝐷𝐹subscript𝔼similar-tosubscript𝜃𝐹subscript𝑝𝐸delimited-[]1subscript𝐷𝐹𝐸𝐼subscript𝔼similar-tosubscript𝜃𝐹subscript𝑝datadelimited-[]subscript𝐷𝐹𝜃\mathcal{L}_{\text{adv}}(D_{F})=-\left(\mathbb{E}_{\theta_{F}\sim p_{E}}\left[% \log\left(1-D_{F}(E(I))\right)\right]+\mathbb{E}_{\theta_{F}\sim p_{\text{data% }}}\left[\log\left(D_{F}(\theta)\right)\right]\right)caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) = - ( blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_E ( italic_I ) ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( italic_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_θ ) ) ] ) (2)


adv(DH)=(𝔼θHpE[log(1DH(E(I)))]+𝔼θHpdata[log(DH(θ))]),subscriptadvsubscript𝐷𝐻subscript𝔼similar-tosubscript𝜃𝐻subscript𝑝𝐸delimited-[]1subscript𝐷𝐻𝐸𝐼subscript𝔼similar-tosubscript𝜃𝐻subscript𝑝datadelimited-[]subscript𝐷𝐻𝜃\mathcal{L}_{\text{adv}}(D_{H})=-\left(\mathbb{E}_{\theta_{H}\sim p_{E}}\left[% \log\left(1-D_{H}(E(I))\right)\right]+\mathbb{E}_{\theta_{H}\sim p_{\text{data% }}}\left[\log\left(D_{H}(\theta)\right)\right]\right),caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) = - ( blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_E ( italic_I ) ) ) ] + blackboard_E start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( italic_D start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_θ ) ) ] ) , (3)

where E𝐸Eitalic_E jointly denotes the image backbone, the mesh encoder and the parameter regressor, θF=concat(θface-shape,θface-jaw,θface-exp)subscript𝜃𝐹concatsubscript𝜃face-shapesubscript𝜃face-jawsubscript𝜃face-exp\theta_{F}=\text{concat}(\theta_{\text{face-shape}},\theta_{\text{face-jaw}},% \theta_{\text{face-exp}})italic_θ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = concat ( italic_θ start_POSTSUBSCRIPT face-shape end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT face-jaw end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT face-exp end_POSTSUBSCRIPT ), θH=concat(θhand-shape,θhand-pose)subscript𝜃𝐻concatsubscript𝜃hand-shapesubscript𝜃hand-pose\theta_{H}=\text{concat}(\theta_{\text{hand-shape}},\theta_{\text{hand-pose}})italic_θ start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = concat ( italic_θ start_POSTSUBSCRIPT hand-shape end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT hand-pose end_POSTSUBSCRIPT ).

Depth loss depthsubscriptdepth\mathcal{L}_{\text{depth}}caligraphic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT: To provide pseudo-3D hand and face keypoints supervision for in-the-wild data, we use a modified SILog Loss [19], an affine-invariant depth loss as our depth supervision depthsubscriptdepth\mathcal{L}_{\text{depth}}caligraphic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT. Formally, let K^Dsubscript^𝐾𝐷\hat{K}_{D}over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT denote the pseudo-ground-truth affine-invariant depth of the face and hand keypoints, and KDsubscript𝐾𝐷K_{D}italic_K start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT denote the rendered depth for the keypoints,

depth=[𝐕𝐚𝐫(log(KD+ε)log(K^D+ε))]1/2,subscriptdepthsuperscriptdelimited-[]𝐕𝐚𝐫subscript𝐾𝐷𝜀subscript^𝐾𝐷𝜀12\mathcal{L}_{\text{depth}}=\left[\mathbf{Var}\left(\log(K_{D}+\varepsilon)-% \log(\hat{K}_{D}+\varepsilon)\right)\right]^{1/2},caligraphic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT = [ bold_Var ( roman_log ( italic_K start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + italic_ε ) - roman_log ( over^ start_ARG italic_K end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + italic_ε ) ) ] start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT , (4)

where 𝐕𝐚𝐫𝐕𝐚𝐫\mathbf{Var}bold_Var is the standard variance operator and ε=107𝜀superscript107\varepsilon=10^{-7}italic_ε = 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT.

Overall, our loss for the mesh and interaction networks is formulated by

=λmeshmesh+λinteractioninteraction+λadvadv+λdepthdepth,subscript𝜆meshsubscriptmeshsubscript𝜆interactionsubscriptinteractionsubscript𝜆advsubscriptadvsubscript𝜆depthsubscriptdepth\mathcal{L}=\lambda_{\text{mesh}}\mathcal{L_{\text{mesh}}}+\lambda_{\text{% interaction}}\mathcal{L_{\text{interaction}}}+\lambda_{\text{adv}}\mathcal{L_{% \text{adv}}}+\lambda_{\text{depth}}\mathcal{L_{\text{depth}}},caligraphic_L = italic_λ start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT interaction end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT interaction end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT , (5)

where λmesh=12.5,λinteraction=5,λdepth=2.5,λadv=1formulae-sequencesubscript𝜆mesh12.5formulae-sequencesubscript𝜆interaction5formulae-sequencesubscript𝜆depth2.5subscript𝜆adv1\lambda_{\text{mesh}}=12.5,\lambda_{\text{interaction}}=5,\lambda_{\text{depth% }}=2.5,\lambda_{\text{adv}}=1italic_λ start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT = 12.5 , italic_λ start_POSTSUBSCRIPT interaction end_POSTSUBSCRIPT = 5 , italic_λ start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT = 2.5 , italic_λ start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT = 1 for all the experiments in the paper; See more details in Appendix C.

4 Experimental Results

4.1 Datasets and Metrics


We employ Decaf [97] for reconstructing 3D face and hand interactions with deformations, along with the in-the-wild dataset we collected with 500 images. We use the hand and face shape, pose, and expression data from Decaf [97], RenderMe-360 [79], and FreiHand [140] for training the adversarial priors. We use the training set of the aforementioned datasets for network training. We utilize the Decaf test set for quantitative evaluation, and additionally, we visualize in-the-wild images from the test set for qualitative evaluation.


We adopt commonly-used metrics for mesh recovery accuracy following [43, 63, 18, 11]:
\bullet Mean Per-Joint Position Error (MPJPE): the average Euclidean distance between predicted keypoints and ground-truth keypoints.
\bullet PAMPJPE: MPJPE after Procrustes Analysis (PA) alignment.
\bullet Per Vertex Error: per vertex error (PVE) with translation.

Following Decaf [97], we use the plausibility metrics mentioned below:
\bullet Collision Distance (Col. Dist.): the average collision distances over vertices and frames;
\bullet Non-Collision Ratio (Non. Col.): the proportion of frames without hand-face collisions;
\bullet Touchness Ratio: the ratio of hand-face contacts among ground truth contacting frames;
\bullet F-Score: the harmonic mean of Non-Collision Ratio and Touchness Ratio.

4.2 Implementation Details

We train the MeshNet, InteractionNet, and IKNet along with the face and hand discriminators with three identical AdamW optimizers with a learning rate of 6×1046superscript1046\times 10^{-4}6 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and a learning rate decay of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, optimizing in an alternating manner. Our batch size is set to 16161616 during the training stage. Each training task takes 40404040 epochs using 48484848 hours. The model is trained and evaluated on 8 Nvidia A6000 GPUs with an AMD 128-core CPU. For a fair comparison, all baseline models used the settings in their original papers. Inference times are calculated on a single Nvidia A6000 GPU.

4.3 Performance on Hand-Face Interaction and Deformation Recovery

In addition to baselines considered in Decaf [97], we compare our method with a representative work in human body/hand mesh recovery, METRO, an end-to-end transformer-based model. For a fair comparison, we compare our method with a modified version of METRO [63] for predicting hand and face meshes, with extra output heads added to predict contact and deformation.

\begin{overpic}[width=433.62pt]{Figs/vis_contact.pdf} \end{overpic}
Figure 3: Qualitative results of hand-face interaction, deformation, and contact recovery by DICE on Decaf and in-the-wild images. In contact visualizations, a deeper color indicates a higher contact probability.
\begin{overpic}[width=325.215pt]{Figs/fig_compare_all} \end{overpic}
Figure 4: Qualitative comparsion of DICE, Decaf [97], PIXIE [23], METRO*[64] on Decaf validation set and in-the-wild images. Our method achieves superior reconstruction accuracy and plausibility in the Decaf [97] dataset, while generalizing well to difficult in-the-wild actions unseen in Decaf.

4.3.1 Quantitative Evaluations

Table 1: Comparison of hand-face interaction and deformation recovery on Decaf.

[b] Methods Type 3D Reconstruction Error Physics Plausibility Metrics Running Time (per image; s)\downarrow PVE‡\downarrow MPJPE\downarrow PAMPJPE\downarrow Col. Dist. \downarrow Non. Col. \uparrow Touchness \uparrow F-Score \uparrow Decaf [97] O 9.659.659.659.65 -- -- 83.683.683.683.6 96.696.696.696.6 89.6 19.5919.5919.5919.59 Benchmark [68, 57] O 17.717.717.717.7 -- -- 19.319.319.319.3 68.468.468.468.4 16.4016.4016.4016.40 PIXIE (hand+face) [23] O 26.326.326.326.3 -- -- 75.975.975.975.9 75.575.575.575.5 -- PIXIE (whole-body) [23] R 39.739.739.739.7 -- -- 51.851.851.851.8 67.667.667.667.6 0.070¯¯0.070\underline{\textbf{0.070}}under¯ start_ARG 0.070 end_ARG METRO* (hand+face) [63] R 11.811.811.811.8 15.415.415.415.4 11.911.911.911.9 80.780.780.780.7 54.854.854.854.8 0.1030.1030.1030.103 DICE (Ours) R 8.32¯¯8.32\underline{\textbf{8.32}}under¯ start_ARG 8.32 end_ARG 9.95¯¯9.95\underline{\textbf{9.95}}under¯ start_ARG 9.95 end_ARG 7.27¯¯7.27\underline{\textbf{7.27}}under¯ start_ARG 7.27 end_ARG 66.666.666.666.6 79.979.979.979.9 72.7¯¯72.7\underline{72.7}under¯ start_ARG 72.7 end_ARG 0.0880.0880.0880.088

  • * parametric version. O and R denote optimization-based and regression-based methods, respectively. {\ddagger} calculated after translating the center of the head to the origin. bold and underline denote the overall best and the best among regression-based approaches, respectively. Note our method operates at an interactive rate (20 fps; 0.049s per image) on an Nvidia 4090 GPU. Here we report the runtime performance on a single A6000 GPU for a fair comparison.

Reconstruction Accuracy In Tab. 1, our method surpasses all baseline methods in terms of reconstruction accuracy, achieving a 7.5%percent7.57.5\%7.5 % reduction in per-vertex error compared to the current state-of-the-art, Decaf. Note that our method is regression-based and allows inference at an interactive rate, while Decaf [97] uses a cumbersome test-time optimization process, taking more than 700700700700x more time per image. Decaf also requires using temporal information in successive frames, while our method only uses a single frame. Our method shows a 30%percent3030\%30 % reduction in reconstruction error compared to the modified METRO baseline, and up to 79%percent7979\%79 % reduction compared to other end-to-end baselines.

Plausibility In addition to high accuracy, our method achieves the highest overall physical plausibility (F-Score) among all regression-based methods. Note that Touchness and Non-Collision ratio are complement to each other and are meaningless when considered individually, while F-Score measures the two values as a whole. Our method has a much lower interpenetration distance (Col. Dist.) compared to Benchmark and PIXIE (hand+face), which consider hand and face separately, therefore generating implausible interactions. Note that PIXIE (whole body) and METRO* show lower collision distances with a much lower Touchness than our method, indicating that the reconstructed hands and faces often appear incorrectly as if they are not interacting. On the other hand, our method shows low collision distance with a high Touchness, indicating plausible hand-face interaction reconstruction.

Contact Estimation In Tab. 2, DICE achieves superior contact estimation performance on Decaf dataset, surpassing previous work [97] in F-Score for both face and hand contacts. Here, the F-score provides a comprehensive measure of both the precision and the recall ratio combined. These two metrics are complementary and less meaningful when only considered individually; See Fig. 3 for qualitative results.

Table 2: Comparison of hand-face interaction and deformation recovery on Decaf.
Method F-score \uparrow Precision \uparrow Recall \uparrow Accuracy\uparrow
Decaf (face) [97] 0.570.570.570.57 0.690.690.690.69 0.490.490.490.49 0.990.990.990.99
Decaf (hand) [97] 0.470.470.470.47 0.620.620.620.62 0.390.390.390.39 0.980.980.980.98
DICE (face) 0.61 0.640.640.640.64 0.570.570.570.57
DICE (hand) 0.50 0.550.550.550.55 0.450.450.450.45 0.980.980.980.98

4.3.2 Qualitative Evaluations

As discussed in Sec. 3.2, the Decaf [97] dataset is collected in an indoor environment with a green screen, which doesn’t reflect the complex environment where real-world hand-face interactions occur. Therefore, a model only trained with the Decaf dataset might have generalization issues when tested on in-the-wild data. Fig. 4 confirms this result by our model’s superior generalization performance on in-the-wild data with unseen identity and pose. As shown in Fig. 3, our method faithfully reconstructs hand-face interaction and deformation and accurately labels the area of contact.

4.4 Ablation Study

In-the-wild data As shown in Tab. 3, adding weak-supervision training and in-the-wild data for DICE training improves all reconstruction error metrics (PVE*, MPJPE, PAMPJPE) while maintaining a high plausibility (F-Score). This is because the limited pose and identity distribution of the Decaf training dataset may cause the model to overfit, and the inclusion of in-the-wild images out of the Decaf data distribution effectively improves the generalization capability of DICE.

Depth Supervision Although depth supervision is only applied to in-the-wild data, as shown in Tab. 3, removing it also significantly affects performance on the Decaf validation set. Without depth loss, wrong predictions in depth are not penalized for in-the-wild data, introducing noise in the training process, and resulting in erroneous depth predictions in the Decaf dataset. As shown in Appendix Fig. 7, the absence of depth supervision introduces ambiguity in the z-direction, resulting in artifacts such as self-collision.

Adversarial Prior The adversarial prior incorporates diverse but realistic pose and shape distribution beyond Decaf [97], ensuring the reality of regressed mesh while allowing for generalization. As shown in Tab. 3, introducing adversarial supervision improves the accuracy and physical plausibility.

Parameter Supervision Supervising parameters directly, in addition to the indirect supervision of parameters by the mesh losses, improves both plausibility and accuracy. This is because direct parameter supervision eliminates ambiguity, without which the network may resort to other parameter combinations that produce incorrect meshes similar in terms of Euclidean distance.

Intermediate Supervision Removing the constraint that 𝐕F,𝐕Hsuperscriptsubscript𝐕𝐹superscriptsubscript𝐕𝐻\mathbf{V}_{F}^{\prime},\mathbf{V}_{H}^{\prime}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT being rough meshes and treating them only as feature maps results in a substantial drop in accuracy and a slight drop in plausibility (F-Score). Also, note that there is a sharp increase in collision distance, which is attributed to the spatial inaccuracy of the final output mesh. This indicates that using an intermediate mesh feature instead of an ordinary feature map increases the spatial accuracy of the output meshes, which also benefits plausibility.

Network Design In Tab. 3, adopting the two-branch architecture, which separates deformation and interaction estimation from mesh vertices regression, improves both accuracy and plausibility.

4.5 Limitations and Future Works

While our method achieves SotA accuracy on the Decaf [97] dataset and generalizes well to unseen scenes and in-the-wild cases, we still have failure cases when the hand-pose interactions are extremely challenging and have severe occlusions (see Appendix D.2). Moreover, while our method effectively recovers hand and face meshes with visually plausible deformations, there remains room for improvement in deformation accuracy and physical plausibility. In the future, physics-based simulation [37, 56, 38, 31, 60, 41] can be used as a stronger prior, producing more physically accurate estimations. In this paper, although we found using 500 in-the-wild images significantly improves the model’s generalization ability, scaling up to a larger amount of in-the-wild data, on the order of millions or billions, would further enhance performance, which we will study in future work.

Table 3: Comparison of hand-face interaction and deformation recovery on Decaf. Bold denotes the best result.

[b] Methods 3D Reconstruction Error Physics Plausibility Metrics PVE*\downarrow MPJPE\downarrow PAMPJPE\downarrow Col. Dist. \downarrow Non. Col. \uparrow Touchness \uparrow F-Score \uparrow DICE (single branch) 11.611.611.611.6 8.518.518.518.51 87.487.487.487.4 57.457.457.457.4 69.369.369.369.3 DICE (w.o. in-the-wild data) 8.938.938.938.93 7.507.507.507.50 74.674.674.674.6 71.971.971.971.9 73.3 DICE (w.o. depthsubscriptdepth\mathcal{L}_{\text{depth}}caligraphic_L start_POSTSUBSCRIPT depth end_POSTSUBSCRIPT) 15.615.615.615.6 19.519.519.519.5 13.713.713.713.7 0.640.640.640.64 58.658.658.658.6 DICE (w.o. paramssubscriptparams\mathcal{L}_{\text{params}}caligraphic_L start_POSTSUBSCRIPT params end_POSTSUBSCRIPT) 10.310.310.310.3 12.812.812.812.8 10.410.410.410.4 80.980.980.980.9 53.953.953.953.9 64.764.764.764.7 DICE (w.o. advsubscriptadv\mathcal{L}_{\text{adv}}caligraphic_L start_POSTSUBSCRIPT adv end_POSTSUBSCRIPT) 10.410.410.410.4 60.760.760.760.7 69.869.869.869.8 DICE (Full) 8.32 9.95 7.27 66.666.666.666.6 79.979.979.979.9 72.772.772.772.7

5 Conclusion

In this work, we present DICE, the first end-to-end approach for reconstructing 3D hand and face interaction with deformation from monocular images. Our approach features a two-branch transformer structure, MeshNet and InteractionNet, to model local deform field and global mesh geometry. An inverse-kinematic model, IKNet, is used to output the animatable parametric hand and face meshes. We also proposed a novel weak-supervision training pipeline, using a small amount of in-the-wild images and supervising with a depth prior and an adversarial loss to provide pose priors. Benefitting from our network design and training scheme, DICE demonstrates state-of-the-art accuracy and plausibility, compared with all previous methods. Meanwhile, our method achieves a fast inference speed (20 fps), allowing for more downstream interactive applications. In addition to strong performance on the standard benchmark, DICE also achieves superior generalization performance on in-the-wild data.


Appendix A Implementation Details

A.1 CNN Backbone

The CNN backbone used in our framework is an HRNet-W64 [99], initialized with ImageNet-pretrained weights. The weights of the backbone would be updated during training. We extract a (49×H49𝐻49\times H49 × italic_H)-dim feature map from this network and upsamples it to a (N×H𝑁𝐻N\times Hitalic_N × italic_H)-dim feature map, where N=Nhk+Nfk+Nhv+Nhv𝑁subscript𝑁subscript𝑘subscript𝑁subscript𝑓𝑘subscript𝑁subscript𝑣subscript𝑁subscript𝑣N=N_{h_{k}}+N_{f_{k}}+N_{h_{v}}+N_{h_{v}}italic_N = italic_N start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_N start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the total number of head and hand keypoints Nhk,Nfksubscript𝑁subscript𝑘subscript𝑁subscript𝑓𝑘N_{h_{k}},N_{f_{k}}italic_N start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT and vertices Nhv,Nfvsubscript𝑁subscript𝑣subscript𝑁subscript𝑓𝑣N_{h_{v}},N_{f_{v}}italic_N start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then, we concatenate the keypoints and the vertices corresponding to the head and hand mean pose as keypoints and vertex queries, resulting in a ((N+3)×H𝑁3𝐻(N+3)\times H( italic_N + 3 ) × italic_H)-dim feature map. Random masking of keypoints and vertex queries of rate 30%percent3030\%30 % is applied, following [63].

A.2 MeshNet and InteractionNet

Our MeshNet and InteractionNet have similar progressive downsampling transformer encoder structures, see Fig. 5 for an illustration. The MeshNet has three component transformer encoders with decreasing feature dimensions. The InteractionNet starts with a fully connected layer that downsamples the feature dimension, followed by two transformer encoders. Each transformer encoder has a Multi-Head Attention module consisting of 4 layers and 4 attention heads. In addition to head and hand mesh features, MeshNet also regresses head and hand keypoints, which are only for supervision and not used by any downstream components.

\begin{overpic}[width=433.62pt]{Figs/ablation_transformer_encoder} \end{overpic}
Figure 5: Structural details of the MeshNet and InteractionNet. (a) MeshNet; (b) InteractionNet; (c) Internal structure of a Transformer Encoder block.

A.3 IKNet

Our IKNets take in rough mesh features 𝐕F,𝐕Hsuperscriptsubscript𝐕𝐹superscriptsubscript𝐕𝐻\mathbf{V}_{F}^{\prime},\mathbf{V}_{H}^{\prime}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_V start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and output the pose and shape parameters (θ,β)𝜃𝛽(\theta,\beta)( italic_θ , italic_β ), as well as the global rotation and translation (R,T)𝑅𝑇(R,T)( italic_R , italic_T ). They feature a Multi-Layer Perceptron (MLP) structure, each consisting of five MLP Blocks and a final fully connected layer. Each MLP Block contains a fully connected layer, followed by a batch normalization layer [42] and a ReLU activation layer. There are two skip-connections, connecting the output of the first block with the input of the third block, and the output of the third block with the input of the final fully connected layer. See Fig. 6 for an illustration. The hand and head IKNets have the same structure, differing only in their input and output dimensions. The hidden dimensions of the two IKNets are 1024.

\begin{overpic}[width=173.44534pt]{Figs/ablation_ik_net} \end{overpic}
Figure 6: Structural details of the IKNet.

A.4 Training and Testing Details

To be consistent with the training setting of Decaf111Confirmed by the authors of Decaf [97], in the Decaf dataset, we use all eight camera views and the subjects S2, S4, S5, S7, and S8 in the training data split for training. For testing, we use only the front view (view 108) and the subjects S1, S3, and S6 in the testing data split. The low, mid, and high-resolution head mesh consists of 559559559559, 1675167516751675, and 5023502350235023 vertices, respectively. The low and high-resolution hand mesh consists of 195195195195 and 778778778778 vertices, respectively. We use the middle-resolution head mesh and the high-resolution hand mesh as the inputs of head and hand IKNets.

Appendix B More Qualitative Comparisons

We demonstrate qualitatively the effect of the absence of the depth loss in Fig. 7. When trained without depth loss, the network is only supervised with 2D information on in-the-wild data, without any constraints in the z-direction. As a result, artifacts such as self-penetration frequently occur in this case. The introduction of depth loss eliminates this ambiguity, allowing the correct relative positioning of hand and face.

\begin{overpic}[width=303.53267pt]{Figs/fig_depth_ablation} \end{overpic}
Figure 7: Qualitative demonstration of the effects of the depth loss. The model generalizes poorly in the z-direction when trained without depth supervision.

Appendix C Addition details on Losses

Here, we provide the details of the mesh losses and the interaction losses. The details of the adversarial loss and the depth loss are already mentioned in the main paper.

C.1 Mesh losses

The mesh loss meshsubscriptmesh\mathcal{L}_{\text{mesh}}caligraphic_L start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT consists of four components.

mesh=reproj+4vert+2key+2params.subscriptmeshsubscriptreproj4subscriptvert2subscriptkey2subscriptparams\mathcal{L}_{\text{mesh}}=\mathcal{L}_{\text{reproj}}+4\mathcal{L}_{\text{vert% }}+2\mathcal{L}_{\text{key}}+2\mathcal{L}_{\text{params}}.caligraphic_L start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT reproj end_POSTSUBSCRIPT + 4 caligraphic_L start_POSTSUBSCRIPT vert end_POSTSUBSCRIPT + 2 caligraphic_L start_POSTSUBSCRIPT key end_POSTSUBSCRIPT + 2 caligraphic_L start_POSTSUBSCRIPT params end_POSTSUBSCRIPT . (6)

Vertices Loss L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss is used for predicted rough 3D face and hand vertices 𝐕fsuperscriptsubscript𝐕𝑓{\mathbf{V}_{f}^{\prime}}bold_V start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝐕hsuperscriptsubscript𝐕{\mathbf{V}_{h}^{\prime}}bold_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, FLAME-regressed undeformed 3D face vertices 𝐕fsuperscriptsubscript𝐕𝑓{{\mathbf{V}_{f}^{*}}}bold_V start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and MANO-regressed 3D hand vertices 𝐕hsubscript𝐕{{\mathbf{V}_{h}}}bold_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT against the ground-truth 3D undeformed face vertices 𝐕^fsubscript^𝐕𝑓\mathbf{\hat{V}}_{f}over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and 3D hand vertices 𝐕^hsubscript^𝐕\mathbf{\hat{V}}_{h}over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

vert=λh(μnonpara𝐕h𝐕^h1+𝐕h𝐕^h1)+λf(μnonpara𝐕f𝐕^h1+𝐕f𝐕^f1),subscriptvertsubscript𝜆subscript𝜇nonparasubscriptnormsuperscriptsubscript𝐕subscript^𝐕1subscriptnormsubscript𝐕subscript^𝐕1subscript𝜆𝑓subscript𝜇nonparasubscriptnormsuperscriptsubscript𝐕𝑓subscript^𝐕1subscriptnormsuperscriptsubscript𝐕𝑓subscript^𝐕𝑓1\mathcal{L}_{\text{vert}}=\lambda_{h}(\mu_{\text{nonpara}}\|\mathbf{V}_{h}^{% \prime}-\mathbf{\hat{V}}_{h}\|_{1}+\|\mathbf{V}_{h}-\mathbf{\hat{V}}_{h}\|_{1}% )+\lambda_{f}(\mu_{\text{nonpara}}\|\mathbf{V}_{f}^{\prime}-\mathbf{\hat{V}}_{% h}\|_{1}+\|\mathbf{V}_{f}^{*}-\mathbf{\hat{V}}_{f}\|_{1}),caligraphic_L start_POSTSUBSCRIPT vert end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT nonpara end_POSTSUBSCRIPT ∥ bold_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_V start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT nonpara end_POSTSUBSCRIPT ∥ bold_V start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_V start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (7)

where λh,λfsubscript𝜆subscript𝜆𝑓\lambda_{h},\lambda_{f}italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are empirically set to 3333 and 1111 respectively. μnonparasubscript𝜇nonpara\mu_{\text{nonpara}}italic_μ start_POSTSUBSCRIPT nonpara end_POSTSUBSCRIPT is set to 4444 to emphasize the supervision on the more complex non-parametric mesh features.
Keypoints Loss We use L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss for predicted rough 3D face and hand keypoints 𝐊fsuperscriptsubscript𝐊𝑓\mathbf{K}_{f}^{\prime}bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝐊hsuperscriptsubscript𝐊\mathbf{K}_{h}^{\prime}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 3D face and hand keypoints extracted from rough mesh 𝐊fmesh,𝐊hmeshsubscript𝐊subscript𝑓meshsubscript𝐊subscriptmesh{\mathbf{K}_{f_{\text{mesh}}}},{\mathbf{K}_{h_{\text{mesh}}}}bold_K start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT, FLAME-regressed 3D face keypoints 𝐊fsubscript𝐊𝑓{\mathbf{K}_{f}}bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and MANO-regressed 3D hand keypoints 𝐊hsubscript𝐊{\mathbf{K}_{h}}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT against the ground-truth 3D undeformed face keypoints 𝐊^fsubscript^𝐊𝑓{\mathbf{\hat{K}}_{f}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and 3D hand keypoints 𝐊^fsubscript^𝐊𝑓{\mathbf{\hat{K}}_{f}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT.

key=μnonpara(𝐊h𝐊^h1+𝐊hmesh𝐊^h1+𝐊f𝐊^f1+𝐊fmesh𝐊^f1)subscriptkeysubscript𝜇nonparasubscriptnormsuperscriptsubscript𝐊subscript^𝐊1subscriptnormsubscript𝐊subscriptmeshsubscript^𝐊1subscriptnormsuperscriptsubscript𝐊𝑓subscript^𝐊𝑓1subscriptnormsubscript𝐊subscript𝑓meshsubscript^𝐊𝑓1\mathcal{L}_{\text{key}}=\mu_{\text{nonpara}}(\|\mathbf{K}_{h}^{\prime}-% \mathbf{\hat{K}}_{h}\|_{1}+\|{\mathbf{K}_{h_{\text{mesh}}}}-\mathbf{\hat{K}}_{% h}\|_{1}+\|\mathbf{K}_{f}^{\prime}-\mathbf{\hat{K}}_{f}\|_{1}+\|{\mathbf{K}_{f% _{\text{mesh}}}}-\mathbf{\hat{K}}_{f}\|_{1})caligraphic_L start_POSTSUBSCRIPT key end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT nonpara end_POSTSUBSCRIPT ( ∥ bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_K start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_K start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (8)
+𝐊f𝐊^f1+𝐊h𝐊^h1,subscriptnormsubscript𝐊𝑓subscript^𝐊𝑓1subscriptnormsubscript𝐊subscript^𝐊1+\|{\mathbf{K}_{f}}-{\mathbf{\hat{K}}_{f}}\|_{1}+\|{\mathbf{K}_{h}}-{\mathbf{% \hat{K}}_{h}}\|_{1},+ ∥ bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , (9)

where μnonparasubscript𝜇nonpara\mu_{\text{nonpara}}italic_μ start_POSTSUBSCRIPT nonpara end_POSTSUBSCRIPT is empirically set to 4444, to put more weight on the non-parametric mesh with high degrees of freedom.
Reprojection loss L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss is used for reprojected rough 3D face and hand keypoints 𝐊fsuperscriptsubscript𝐊𝑓\mathbf{K}_{f}^{\prime}bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝐊hsuperscriptsubscript𝐊\mathbf{K}_{h}^{\prime}bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 3D face and hand keypoints extracted from rough mesh 𝐊fmesh,𝐊hmeshsubscript𝐊subscript𝑓meshsubscript𝐊subscriptmesh{\mathbf{K}_{f_{\text{mesh}}}},{\mathbf{K}_{h_{\text{mesh}}}}bold_K start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT, FLAME-regressed 3D face keypoints 𝐊^fsubscript^𝐊𝑓{\mathbf{\hat{K}}_{f}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and MANO-regressed 3D hand keypoints 𝐊^hsubscript^𝐊{\mathbf{\hat{K}}_{h}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT against the ground-truth face and hand 2D keypoints 𝐊^f2D,𝐊^h2Dsubscript^𝐊subscript𝑓2Dsubscript^𝐊subscript2D\mathbf{\hat{K}}_{f_{\text{2D}}},\mathbf{\hat{K}}_{h_{\text{2D}}}over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

reproj=λh(Π(𝐊h)𝐊^h2D1+Π(𝐊hmesh)𝐊^h2D1+Π(𝐊h)𝐊^h2D1)subscriptreprojsubscript𝜆subscriptnormΠsuperscriptsubscript𝐊subscript^𝐊subscript2D1subscriptnormΠsubscript𝐊subscriptmeshsubscript^𝐊subscript2D1subscriptnormΠsubscript𝐊subscript^𝐊subscript2D1\mathcal{L}_{\text{reproj}}=\lambda_{h}(\|\Pi(\mathbf{K}_{h}^{\prime})-\mathbf% {\hat{K}}_{h_{\text{2D}}}\|_{1}+\|\Pi({\mathbf{K}_{h_{\text{mesh}}}})-\mathbf{% \hat{K}}_{h_{\text{2D}}}\|_{1}+\|\Pi({\mathbf{K}_{h}})-\mathbf{\hat{K}}_{h_{% \text{2D}}}\|_{1})caligraphic_L start_POSTSUBSCRIPT reproj end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( ∥ roman_Π ( bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ roman_Π ( bold_K start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ roman_Π ( bold_K start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (10)
+λf(Π(𝐊f)𝐊^f2D1+Π(𝐊fmesh)𝐊^f2D1+Π(𝐊f)𝐊^f2D1),subscript𝜆𝑓subscriptnormΠsuperscriptsubscript𝐊𝑓subscript^𝐊subscript𝑓2D1subscriptnormΠsubscript𝐊subscript𝑓meshsubscript^𝐊subscript𝑓2D1subscriptnormΠsubscript𝐊𝑓subscript^𝐊subscript𝑓2D1+\lambda_{f}(\|\Pi(\mathbf{K}_{f}^{\prime})-\mathbf{\hat{K}}_{f_{\text{2D}}}\|% _{1}+\|\Pi({\mathbf{K}_{f_{\text{mesh}}}})-\mathbf{\hat{K}}_{f_{\text{2D}}}\|_% {1}+\|\Pi({\mathbf{K}_{f}})-\mathbf{\hat{K}}_{f_{\text{2D}}}\|_{1}),+ italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ( ∥ roman_Π ( bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ roman_Π ( bold_K start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT mesh end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ roman_Π ( bold_K start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) - over^ start_ARG bold_K end_ARG start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT 2D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , (11)

where ΠΠ\Piroman_Π is the learned camera projection function. λh,λfsubscript𝜆subscript𝜆𝑓\lambda_{h},\lambda_{f}italic_λ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are set to 4444 and 1111 respectively.

Parameter loss We apply L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss on the regressed hand and face pose, shape, and facial expression parameters against their respective ground truths.

face-params=(βfβ^f1+θf-expθ^f-exp1+θf-poseθ^f-pose1)/3subscriptface-paramssubscriptnormsubscript𝛽fsubscript^𝛽f1subscriptnormsubscript𝜃f-expsubscript^𝜃f-exp1subscriptnormsubscript𝜃f-posesubscript^𝜃f-pose13\mathcal{L}_{\text{face-params}}=(\|\beta_{\text{f}}-\hat{\beta}_{\text{f}}\|_% {1}+\|\theta_{\text{f-exp}}-\hat{\theta}_{\text{f-exp}}\|_{1}+\|\theta_{\text{% f-pose}}-\hat{\theta}_{\text{f-pose}}\|_{1})/3caligraphic_L start_POSTSUBSCRIPT face-params end_POSTSUBSCRIPT = ( ∥ italic_β start_POSTSUBSCRIPT f end_POSTSUBSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT f end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_θ start_POSTSUBSCRIPT f-exp end_POSTSUBSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT f-exp end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_θ start_POSTSUBSCRIPT f-pose end_POSTSUBSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT f-pose end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / 3 (12)
hand-params=(βhβ^h1+θhθ^h1)/2subscripthand-paramssubscriptnormsubscript𝛽hsubscript^𝛽h1subscriptnormsubscript𝜃hsubscript^𝜃h12\mathcal{L}_{\text{hand-params}}=(\|\beta_{\text{h}}-\hat{\beta}_{\text{h}}\|_% {1}+\|\theta_{\text{h}}-\hat{\theta}_{\text{h}}\|_{1})/2caligraphic_L start_POSTSUBSCRIPT hand-params end_POSTSUBSCRIPT = ( ∥ italic_β start_POSTSUBSCRIPT h end_POSTSUBSCRIPT - over^ start_ARG italic_β end_ARG start_POSTSUBSCRIPT h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∥ italic_θ start_POSTSUBSCRIPT h end_POSTSUBSCRIPT - over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT h end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / 2 (13)
params=face-params+hand-paramssubscriptparamssubscriptface-paramssubscripthand-params\mathcal{L}_{\text{params}}=\mathcal{L}_{\text{face-params}}+\mathcal{L}_{% \text{hand-params}}caligraphic_L start_POSTSUBSCRIPT params end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT face-params end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT hand-params end_POSTSUBSCRIPT (14)

C.2 Interaction losses

The interaction loss interactionsubscriptinteraction\mathcal{L}_{\text{interaction}}caligraphic_L start_POSTSUBSCRIPT interaction end_POSTSUBSCRIPT consists of four components:

interaction=0.2touch+0.6contact+collision+6deform.subscriptinteraction0.2subscripttouch0.6subscriptcontactsubscriptcollision6subscriptdeform\mathcal{L}_{\text{interaction}}=0.2\mathcal{L}_{\text{touch}}+0.6\mathcal{L}_% {\text{contact}}+\mathcal{L}_{\text{collision}}+6\mathcal{L}_{\text{deform}}.caligraphic_L start_POSTSUBSCRIPT interaction end_POSTSUBSCRIPT = 0.2 caligraphic_L start_POSTSUBSCRIPT touch end_POSTSUBSCRIPT + 0.6 caligraphic_L start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT collision end_POSTSUBSCRIPT + 6 caligraphic_L start_POSTSUBSCRIPT deform end_POSTSUBSCRIPT . (15)

Deformation loss Due to the human anatomy, some vertices on the face are more easily deformed than other vertices. Therefore, we impose an adaptive weighting on each vertex and use square loss to penalize large deformation. We also have a regularization term to penalize extremely large deformations.

deform=i(1+μdi^2)di^di22+λidi,subscriptdeformsubscript𝑖1𝜇subscriptnorm^subscript𝑑𝑖2superscriptsubscriptnorm^subscript𝑑𝑖subscript𝑑𝑖22𝜆subscript𝑖normsubscript𝑑𝑖\mathcal{L}_{\text{deform}}=\sum_{i\in\mathcal{I}}(1+\mu\|\hat{d_{i}}\|_{2})\|% \hat{d_{i}}-d_{i}\|_{2}^{2}+\lambda\sum_{i\in\mathcal{L}}\|d_{i}\|,caligraphic_L start_POSTSUBSCRIPT deform end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT ( 1 + italic_μ ∥ over^ start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ over^ start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_L end_POSTSUBSCRIPT ∥ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ , (16)

where \mathcal{I}caligraphic_I is the set of indices of face vertices, disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, di^^subscript𝑑𝑖\hat{d_{i}}over^ start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG are the predicted and ground truth deformation vector for index i𝑖iitalic_i, and ={i:di2>3cm}conditional-set𝑖subscriptnormsubscript𝑑𝑖23𝑐𝑚\mathcal{L}=\{i\in\mathcal{I}:\|d_{i}\|_{2}>3cm\}caligraphic_L = { italic_i ∈ caligraphic_I : ∥ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 3 italic_c italic_m } the vertices of large deformations. μ𝜇\muitalic_μ and λ𝜆\lambdaitalic_λ are empirically set to be 5000500050005000, 100100100100 respectively.

Touch loss Let 𝐕FCsubscript𝐕subscript𝐹𝐶\mathbf{V}_{F_{C}}bold_V start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝐕HCsubscript𝐕subscript𝐻𝐶\mathbf{V}_{H_{C}}bold_V start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the set of face and hand vertices that are predicted by the model to have contact probability greater than

touch=CD(𝐕FC,𝐕HC)+CD(𝐕HC,𝐕FC),subscripttouchCDsubscript𝐕subscript𝐹𝐶subscript𝐕subscript𝐻𝐶CDsubscript𝐕subscript𝐻𝐶subscript𝐕subscript𝐹𝐶\mathcal{L}_{\text{touch}}=\text{CD}(\mathbf{V}_{F_{C}},\mathbf{V}_{H_{C}})+% \text{CD}(\mathbf{V}_{H_{C}},\mathbf{V}_{F_{C}}),caligraphic_L start_POSTSUBSCRIPT touch end_POSTSUBSCRIPT = CD ( bold_V start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + CD ( bold_V start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , (17)

where CD(X,Y)CD𝑋𝑌\text{CD}(X,Y)CD ( italic_X , italic_Y ) gives the mean Chamfer Distance (CD) between each point in X𝑋Xitalic_X to the closest point in Y𝑌Yitalic_Y.

Collision loss Let 𝐕HColsubscript𝐕subscript𝐻Col\mathbf{V}_{H_{\text{Col}}}bold_V start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT Col end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the set of hand vertices that penetrates the face surface, 𝐕Fsubscript𝐕𝐹\mathbf{V}_{F}bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and 𝐃Fsubscript𝐃𝐹\mathbf{D}_{F}bold_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denote the predicted face mesh vertices and deformations.

collision=CD(𝐕HCol,𝐕F𝐃F).subscriptcollisionCDsubscript𝐕subscript𝐻Colsubscript𝐕𝐹subscript𝐃𝐹\mathcal{L}_{\text{collision}}=\text{CD}(\mathbf{V}_{H_{\text{Col}}},\mathbf{V% }_{F}-\mathbf{D}_{F}).caligraphic_L start_POSTSUBSCRIPT collision end_POSTSUBSCRIPT = CD ( bold_V start_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT Col end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_D start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) . (18)

Contact loss Let 𝐂Hsubscript𝐂𝐻\mathbf{C}_{H}bold_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and 𝐂Fsubscript𝐂𝐹\mathbf{C}_{F}bold_C start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denote the predicted hand and face contact probabilities and 𝐂^Hsubscript^𝐂𝐻{\mathbf{\hat{C}}_{H}}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, 𝐂^Fsubscript^𝐂𝐹{\mathbf{\hat{C}}_{F}}over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denote the ground-truth contact labels.

contact=BCE(𝐂H,𝐂^H)+BCE(𝐂F,𝐂^F),subscriptcontactBCEsubscript𝐂𝐻subscript^𝐂𝐻BCEsubscript𝐂𝐹subscript^𝐂𝐹\mathcal{L}_{\text{contact}}=\text{BCE}(\mathbf{C}_{H},{\mathbf{\hat{C}}_{H}})% +\text{BCE}({\mathbf{C}_{F}},{\mathbf{\hat{C}}_{F}}),caligraphic_L start_POSTSUBSCRIPT contact end_POSTSUBSCRIPT = BCE ( bold_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ) + BCE ( bold_C start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , over^ start_ARG bold_C end_ARG start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) , (19)

where BCE denotes the binary cross-entropy loss.

Appendix D More Discussions

D.1 Performance under Challenging Occlusion.

\begin{overpic}[width=433.62pt]{Figs/ablation_decaf_inaccurate.pdf} \end{overpic}
Figure 8: Examples of failed keypoint estimation in case of large self-occlusion. (a) input image; (b) inaccurate keypoint estimation by the same keypoint estimators used in Decaf [68, 3]; (c) reconstructed hand-face interaction by our method. (d) reconstructed hand-face interaction by Decaf.

As seen in Fig. 8, our end-to-end DICE method is robust under challenging self-occlusion cases, such as the hand covering more than half of the face. On the other hand, Decaf [97], which requires an initial keypoint prediction for test-time optimization, performs poorly in this situation.

D.2 Failure Cases

In Fig. 9, we demonstrate the failure cases of our method. As shown in Fig. 9 (a), when there is a complex interaction between the hand and face, such as the presence of a cleaning sponge, there is a drop in the reconstruction accuracy of the hand mesh recovery. Also, as in Fig. 9 (b), When the face completely occludes the hand, a highly challenging scenario unseen in the training data, our model could not faithfully reconstruct the hand position.

\begin{overpic}[width=433.62pt]{Figs/ablation_failure_cases.pdf} \end{overpic}
Figure 9: Samples of failure cases.