\newcolumntype

*¿ \newcolumntype^¿

Towards Geometric-Photometric Joint Alignment for Facial Mesh Registration

Xizhi Wang
Zhejiang Lab
Yaxiong Wang
Hefei University of Technology
Mengjian Li
Zhejiang Lab

Abstract

This paper presents a Geometric-Photometric Joint Alignment(GPJA) method, for accurately aligning human expressions by combining geometry and photometric information. Common practices for registering human heads typically involve aligning landmarks with facial template meshes using geometry processing approaches, but often overlook photometric consistency. GPJA overcomes this limitation by leveraging differentiable rendering to align vertices with target expressions, achieving joint alignment in geometry and photometric appearances automatically, without the need for semantic annotation or aligned meshes for training. It features a holistic rendering alignment strategy and a multiscale regularized optimization for robust and fast convergence. The method utilizes derivatives at vertex positions for supervision and employs a gradient-based algorithm which guarantees smoothness and avoids topological defects during the geometry evolution. Experimental results demonstrate faithful alignment under various expressions, surpassing the conventional ICP-based methods and the state-of-the-art deep learning based method. In practical, our method enhances the efficiency of obtaining topology-consistent face models from multi-view stereo facial scanning.

1 Introduction

Nowadays, professional studios in industry and academia commonly use synchronized multi-view stereo setups for facial scanning [17, 36, 44], ensuring high-fidelity results in controlled settings. These setups aim to generate topology-consistent meshes for different subjects with various facial expressions. Typically, conventional pipelines [13] involve constructing raw scans from multi-view images, followed by manual processes like marker point tracking, clean-up, or key-framing [4], which is labor-intensive and time-consuming, limiting its application in film, gaming, AR/VR industry. To fulfil automatic registration, geometry based methods have been widely employed [1, 21]. However, these methods primarily focus on geometric alignment, neglecting the photometric consistency. To remedy this issue, this paper aims to achieve a joint alignment in terms of geometry and photometric appearances. To this end, two challenges need to be addressed.

The first challenge is to establish a proper deformation field to guide the alignment process, especially for the challenging areas such as mouths and eyes. Previous attempts have been made to construct correspondences by landmarks or optical flow to aid the photomeric alignment [13, 1]. However, the offset vectors obtained through theses approaches in the 2D image space are insufficient in guiding deformation in 3D world. Gafni et al. [14] employ implicit volumetric representation to combine shape and appearance recovery for realistic rendering, which lacks explicit geometry constraints. In response to this challenge, we propose a differentiable rendering [20, 31] based registration framework to generate topology-consistent facial meshes from multi-view images. In particular, our approach includes a Holistic Rendering Alignment(HRA) which incorporates constraints from color, depth and surface normals, facilitating alignment through automatic differentiation without explicit correspondence computation.

Refer to caption — Figure 1: Given multi-view images and (a) the textured template mesh, we propose a novel method GPJA based on differentiable rendering to achieve geometric and photometric alignment jointly for facial meshes. (b) The aligned meshes are rendered with the shared texture map as the template in (a). (c) The zoomed renderings of eyes and mouths demonstrate photometric alignment with the reference images.

The second challenge for facial mesh registration lies in generating faithful output meshes while retaining the topological structure. Aligning facial expressions involves large-step geometry deformation, and thus is prone to topological artifacts. In light of this problem, we resort to a Multiscale Regularized Optimization that combines a modified gradient descent algorithm [30] with coarse-to-fine remeshing scheme. Starting with the coarsest template, the mesh is tessellated periodically while updating the vertices with an efficient regularized geometry optimization for the contraints collected from holistic rendering alignment. Our multiscale regularized optimization ensures smoothness and fast convergence without requiring a specific amount of training data.

We validate our method through experiments on six subjects from the Face-Scape [43] dataset, covering diverse facial expressions. The joint alignment is examined by both geometric and image metrics, demonstrating the effectiveness of our approach. The aligned meshes produced by our method are of high quality, free from topological errors, and accurately warped even in challenging regions like mouths and eyes.

Our contributions are summarized as following:

•

A new method named GPJA achieving joint alignment in geometry and photometric appearances for facial meshes, without any semantic annotation like facial landmarks.
•

A holistic rendering alignment module based on differentiable rendering that effectively generates the deformation field for joint alignment.
•

A multiscale regularized optimization that produces high-quality aligned meshes using an efficient gradient-based algorithm.

2 Related Work

Our research centers on the registration of facial meshes for topology-consistent geometry. This section provides a literature review relevant to our study.

Geometry Processing Methods. Non-rigid registration is a well-established technique in geometry processing for war** a template mesh to raw scans [40, 37, 3]. The strong geometric prior of the template model enhances local shape matching and enables large-step deformation, making it suitable for facial mesh processing. The Iterative Closest Point(ICP) algorithm is a commonly used framework [1, 37, 22] for registration. With a template based on 3D Morphable Models [43, 16] as initialization, ICP minimizes the error between landmarks on the template and the scans, resulting in a rough alignment. Then, a fine-tuning stage involves searching for valid correspondences in the spatial neighborhood, and war** the template leveraging data fidelity and smoothness terms. Previous work has explored regularization terms [42, 32] and correspondences [41] in ICP-based algorithms. However, these methods [43, 9, 16] have limitations in achieving photometric consistency as they rely solely on geometric information.

To address photometric alignment, researchers have explored incorporating optical flow [13, 33, 8] into the fine-tuning stage of facial mesh registration. However, optical flow alone is inadequate for handling significant differences between the source template and target scans, as well as occlusion changes around the eyes and mouth. Another challenge in facial mesh registration is maintaining smooth contours in facial features [5], which is difficult due to occlusions and color changes. Previous approaches have used user-guided methods [12] or contour extraction [15] to address these challenges, but they either involve lengthy workflows or specific treatments.

Deep Learning Based Methods. Recent advancements in deep learning approaches for generating topology-consistent meshes include ToFu [23], which predicts probabilistic distributions of individual vertices of a template face mesh to reconstruct registered face geometry. TEMPEH [6] enhances ToFu with a transformer-based architecture, while NPHM [16] models head geometry using a neural field representation that parameterizes shape and expressions in disentangled latent spaces. While these methods prioritize geometric alignment, they do not guarantee rigid photometric consistency. ReFA [25] introduces a recurrent network operating in the texture space for predicting positions and texture maps. Although these deep learning methods represent progress in facial mesh registration, they require a substantial amount of registered data processed with classical ICP-based algorithms.

In addition to mesh-based representations, implicit volumetric representations have gained popularity in reconstruction. Several studies have extended NeRF [27] for dynamic face reconstruction [34, 45, 2]. The pipeline typically begins with explicit parametric models, followed by estimating a deformation field. Finally, a volumetric renderer is used to generate densities and colors. However, NeRF-style methods lack supervision for aligning explicit-represented geometry and generally do not produce production-ready geometries despite decent rendering results [25].

3 Preliminaries

Before discussing the details of our methodology, we first give a brief introduction on the differentiable rendering to ease the understanding of our method.

Given a 3D scene containing mesh-based geometries, lights, materials, textures, cameras etc., a renderer synthesizing a 2D image I of each screen pixel $\boldsymbol{p}(\textit{x},\textit{y})$ can be formulated as:

\boldsymbol{I}(\boldsymbol{p}(x,y))=F(\boldsymbol{x};\Theta),

(1)

where the function $F(\cdot)$ represents the rendering process, encompassing various computations such as shading, interpolation, projection, and antialiasing. The output of this function can be RGB colors, normals, depths, or label images etc.. In our scenario, the parameter to be optimized is the positions of mesh vertices denoted by $\boldsymbol{x}\in\mathbb{R}^{n\times 3}$ . $\Theta$ symbolizes a set of scene parameters known in advance, including camera poses, lighting, texture color, and other relevant factors.

Differentiable rendering augments renderers by additionally providing derivatives with respect to certain scene parameters, which is an emerging tool for inverse problems [31, 28]. The objective of inverse rendering is to recover some specific scene parameters through gradient-based optimization on a scalar loss function $\mathcal{L}$ which is usually defined as the sum of pixel-wise differences between the rendered images and reference images $\boldsymbol{I}^{\textit{{r}}}\in\mathbb{R}^{w\times h\times c}$ across $v$ views. In this work, we adopt the $L_{1}$ -norm for loss functions.

\mathcal{L}\left(\boldsymbol{x}\mid\Theta,\boldsymbol{I}^{\textit{{r}}}\right)% =\sum_{j}^{v}\left|F(\boldsymbol{x}\mid\Theta_{j})-\boldsymbol{I}_{j}^{\textit% {{r}}}\right|.

(2)

In our setting of joint alignment registration, we use the derivative $\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}}$ to guide the template mesh in fitting the target expressions while maintaining topological consistency.

4 Joint Alignment of Facial Meshes

Figure 2 depicts the pipeline of our method. Starting with the provided textured template mesh and raw scans, each iteration first computes deformation from the holistic rendering alignment and then optimize the vertices via the multiscale regularized optimization.

4.1 Holistic Rendering Alignment

Motivation. Using the provided template mesh and raw scans, non-rigid 3D registration calculates the deformation parameterized on the vertices of the template mesh to align with the target scan. While the geometric alignment has been well studied [10], previous photometric alignment attempts mainly learn 2D-constrained deformations [33, 13] expressed in the image space, which are prone to false correspondences and ambiguities when elevating to the 3D space. On the other hand, we note that differentiable rendering is capable of providing automatically computed derivatives $\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}}$ as guidance to deform the template mesh ${\boldsymbol{T}}$ into joint alignment, avoiding correspondence errors from semantic facial features [1] or offset vectors by optical flow [33]. Furthermore, differentiable rendering retrieves the deformation directly in the 3D space, addressing occlusion issues in eye and mouth regions.

With the above consideration, we propose a Holistic Rendering Alignment (HRA) mechanism, incorporating multiple cues with the aid of differentiable rendering. HRA collects constraints from three different aspects, i.e., color, depth, and surface normals:

\mathcal{L}=\mathcal{L}_{color}+\mathcal{L}_{depth}+\mathcal{L}_{normal}.

(3)

Color Constraint. $\mathcal{L}_{color}$ seeks to impose constraints for photometric alignment by comparing the rendered image with the observed multi-view color images. However, the deformation on the inner lip contours is occasionally disturbed by occlusion changes. To address this issue, we deliberately exclude the interior mouth from the color constraint by a masking strategy.

As illustrated in Fig. 3, a binary mask image $\boldsymbol{B}\in\left\{0,1\right\}^{t\times t\times 3}$ is manually created in accordance with the color texture $\boldsymbol{C}_{\boldsymbol{T}}\in\mathcal{\mathbb{R}}^{t\times t\times 3}$ of the given template ${\boldsymbol{T}}$ . The mask image labels the interior mouth region, which corresponds to the mouth socket of $\boldsymbol{T}$ .

The color constraint is defined as the summation of the element-wise absolute difference between the rendered image and the reference image, weighted by the mask.

\begin{split}\mathcal{L}_{color}\left(\boldsymbol{x}\right)=\sum_{j}^{v}|\left% (F_{S}\left(\boldsymbol{x}\mid\boldsymbol{P}_{j},\boldsymbol{S},\boldsymbol{C}% _{\boldsymbol{T}}\right)-\boldsymbol{I}_{j}^{\textit{{r}}}\right)\odot\\ F_{S}\left(\boldsymbol{x}\mid\boldsymbol{P}_{j},\boldsymbol{B}\right)|,\end{split}

(4)

where the shading function $F_{S}:\mathbb{R}^{n\times 3}\rightarrow\mathbb{R}^{w\times h\times 3}$ is defined as rendering diffuse objects with the lighting estimated from the capture setup in spherical harmonics forms $\boldsymbol{S}\in\mathcal{\mathbb{R}}^{9\times 3}$ [35], $\boldsymbol{P}_{j}\in\mathbb{PL}\left(3\right)$ remarks projective matrix of the $j$ -th camera, and $\odot$ denotes element-wise multiplication. By synthesizing a binary image using $F_{S}(\boldsymbol{x}|\boldsymbol{P}_{j},\boldsymbol{B})$ , where the interior mouth is assigned zeros and the rest with ones, Eq. 4 can mask out the interior mouth for color constraint. Experimental results confirm the effectiveness of this mask operation.

Depth Constraint. $\mathcal{L}_{depth}$ pursues to achieve geometric alignment. The color images observed in real-world scenes exhibit inconsistencies due to variations in shading and imaging formulation across views and expressions. Therefore, the derivatives of color constraint inevitably introduce bias and noise, making disturbances for accurate alignment. By including the depth term in HRA, we provide strong supervision for preserving geometric fidelity during registration, ensuring robust outputs.

The depth constraint measures the depth disparity between the deformed template and the target scan:

\mathcal{L}_{depth}\left(\boldsymbol{x}\right)=\sum_{j}^{v}|\left(F_{D}\left(% \boldsymbol{x}\mid\boldsymbol{P}_{j}\right)-F_{D}\left(\boldsymbol{\tilde{x}}% \mid\boldsymbol{P}_{j}\right)\right)|,

(5)

where $F_{D}:\mathbb{R}^{n\times 3}\rightarrow\mathbb{R}^{w\times h}$ represents the depth rendering operation [18] based on the perspective projection function, and $\boldsymbol{\tilde{x}}\in\mathbb{R}^{m\times 3}$ denotes vertex coordinates of the scan with $m$ vertices.

Normal Constraint. $\mathcal{L}_{normal}$ assists with fidelity preservation. While the color and depth terms establish guidance for overall alignment, discrepancies in color tones and texture-less areas can cause artifacts on aligned meshes. The normal constraint compensates for these issues, leading to improved alignment and sharper details with fewer vertices.

In specific, the normal constraint penalties the disparity of surface normals between the deformed template and the target scan.

\mathcal{L}_{normal}\left(\boldsymbol{x}\right)=\sum_{j}^{v}|\left(F_{N}\left(% \boldsymbol{x}\mid\boldsymbol{P}_{j}\right)-F_{N}\left(\boldsymbol{\tilde{x}}% \mid\boldsymbol{P}_{j}\right)\right)|,

(6)

where $F_{N}:\mathbb{R}^{n\times 3}\rightarrow\mathbb{R}^{w\times h\times 3}$ represents the process of computing and projecting surface normals, as implemented in deferred shading [29].

The holistic rendering alignment benefits from differentiable rendering to obtain derivatives, replacing the explicitly computed correspondences from previous approaches [41, 10]. By incorporating multiple cues, holistic rendering alignment ensures joint alignment of geometry and photometric appearances.

4.2 Multiscale Regularized Optimization

The HRA module guides deformation towards the joint alignment, but its derivative vectors are noisy due to shading and imaging variations. The lack of regularization of the derivatives could cause topological errors if directly applied as update steps to each vertex [30, 19]. To address these issues, a common approach is to add a smoothness term [41, 10], such as the ARAP energy [39] or the Laplacian differential representation [40]. However, these solutions introduce problems in tuning the regularization weight for outputs with both smooth and non-smooth regions [30], and implementing a robust solution scheme for non-linear optimization [38]. To overcome these challenges, we propose a multiscale regularized mechanism for generating high-quality aligned meshes, comprising two elements:

Vertex Optimization. We follow the work of Nicolet et al. [30] to update vertices iteratively. The authors suggest that the second-order optimization like Newton’s method is better for smoothing geometry, and the computationally expensive Hessian matrix can be replaced by reparameterization of $\boldsymbol{x}$ with the introduced variables $\boldsymbol{\mu}$ to ensure the smoothness of recovered $\boldsymbol{x}$ :

\boldsymbol{x}=\left(\boldsymbol{E}+\lambda\boldsymbol{L}\right)^{-1}% \boldsymbol{\mu},

(7)

\boldsymbol{\mu}\leftarrow\boldsymbol{\mu}-\eta\frac{\partial\boldsymbol{x}}{% \partial\boldsymbol{\mu}}\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}},

(8)

where $\eta>0$ means the learning rate, $\boldsymbol{E}\in\mathbb{I}^{n\times n}$ denotes identity matrix, and $\lambda>0$ is the regularization weight. $\boldsymbol{L}\in\mathbb{R}^{n\times n}$ is a discrete Laplacian operator defined on a mesh $\mathcal{M}=\left(\mathcal{X},\mathcal{E}\right)$ with $n$ vertices $\mathcal{X}$ and $m$ edges $\mathcal{E}$ :

\boldsymbol{L}_{ij}=\begin{cases}-w_{ij},&\textrm{if}\quad(i,j)\in\mathcal{E}% \\ \sum_{(i,k)\in\mathcal{E}}w_{ik}&\mathrm{if}\quad i=j\\ 0&\mathrm{otherwise},\end{cases}

(9)

where $w_{ij}$ is the cotangent weight described in [11].

Combining Eq. 7 and Eq. 8, the update formula for vertices at each iteration is:

\boldsymbol{x}\leftarrow\boldsymbol{x}-\eta\left(\boldsymbol{E}+\lambda% \boldsymbol{L}\right)^{-2}\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}}.

(10)

Multiscale Learning. The coarse-to-fine multiscale learning scheme, based on the tessellation technique [7], periodically decreases the average edge length of the triangle mesh while retaining the shape and the texture parameterization. The multiscale scheme allows for parameter adjustment at each scale to capture fine details without distorting the topology. In particular, the Laplacian matrix $\boldsymbol{L}$ is updated for each tessellation step as the topology changes. The template mesh $\boldsymbol{T}$ , initially at the coarest level, undergoes more rigid deformations with a higher regularization parameter $\lambda$ to fit the overall target expressions. As the mesh is tessellated to finer scales, $\lambda$ is decreased to capture more details.

The multiscale regularized optimization offers several advantages: (1) it produces high-quality meshes effectively with significantly reduced distortion and self-intersection artifacts; (2) it converges quickly without additional training data or priors other than the textured template.

Table 1: Quantitative evaluations on 10 expressions per subject. Left Table: Our method outperforms the deep learning method NPHM [16], and the ICP-based registration FaceScape TU [43] in geometric accuracy. Right Table: The image metrics are comparable to the photo-realistic results by the NeRF-style pipeline NeP [26].

	Geometric Metric(mm)↓
	GPJA	NPHM	FaceScape TU
Subject 7	0.254	0.338	0.633
Subject 122	0.265	0.294	0.817
Subject 212	0.101	0.436	0.948
Subject 340	0.142	0.253	0.747
Subject 344	0.161	0.291	0.831
Subject 350	0.160	0.436	0.811
Overall	0.181	0.341	0.797

Image Metric
PSNR↑	SSIM↑	LPIPS↓
24.75	0.7810	0.06868
23.56	0.7443	0.07081
25.51	0.7699	0.09368
24.10	0.7378	0.06802
23.08	0.7569	0.06791
25.38	0.7320	0.07857
24.40	0.7538	0.07461

4.3 Textured Template Mesh Creation

To adapt to the proposed pipeline, we construct a reliable textured template mesh which is used throughout the joint alignment for different expressions per subject. As illustrated in Fig. 4, we select the mouth-open expression from each subject to reconstruct it into the template mesh. This is because the mouth socket is crucial for accommodating flexible movements during deformation. Initially, we manually sculpt a coarse genus-zero mesh with pre-designed texture parametrization, resembling a bust sculpture. This genus-zero surface is appropriate for representing the head geometry due to similar topological structures without loss of generality .

The genus-zero bust mesh as initialization is processed through the GPJA pipeline depicted in Fig. 2. At this stage, only the depth constraint in HRA is involved. After the depth guiding reconstruction, a tessellated mesh is established. The vertex positions are then fixed, and the texture color map $\boldsymbol{C}_{\boldsymbol{T}}$ is updated through the color constraint via gradient descent. Finally, the tessellated mesh is decimated using seam-aware simplification [24] while preserving the texture parameterization, resulting in the creation of the textured template mesh $\boldsymbol{T}$ .

5 Experiments

Experiment Setup. As an early exploration of semantic annotation-free photometric alignment, many existing public datasets (LYHM[9], NPHM[16],etc.) that primarily consist of 3D scans or registered meshes rather than original images are unsuitable for GPJA. We also found FaMoS[6] inappropriate due to its sparse down-sampled RGB views and subjects with noticeable facial markers. Following our investigation, the FaceScape dataset emerged as the most fitting benchmark with high-resolution images from dense viewpoints and uniform lighting conditions for the subjects.

In order to thoroughly assess GPJA’s capability, six subjects including four publishable ones are chosen, and we deliberately selected 10 highly different expressions with large deformation and occlusion variations for each subject to evaluate GPJA. Six to eight images covering frontal and side views are used as color reference images, and are downsampled to 2K resolution.

GPJA is implemented using PyTorch with Nvdiffrast [20] as the differentiable renderer. The multiscale scheme involves remeshing for 4 times, increasing the vertex count from 16K to 250K. For the first two levels, the regularization parameter $\lambda$ is set to 200 and 120, while for the remaining levels, it is set to 80 and 50. Convergence is achieved within 1500 iterations per expression, taking approximately 15 minutes on a single RTX 3090 graphics card for all experiments.

Metrics. We evaluate the effectiveness of the method from both geometric and photometric aspects, since GPJA achieves joint alignments. The assessment of geometric alignment uses raw scans of the face region as ground truth, measuring distances between ground truth vertices and their closest Euclidean distance on aligned surfaces in millimeter. For photometric alignment assessment, multi-view images captured by the setup are considered as ground truth, while rendered images are synthesized at the same camera poses using aligned meshes with the texture color map of the template. We utilize commonly used image metrics including PSNR, SSIM, and LPIPS for evaluation.

Result Analysis on Geometric Alignment. We compare our GPJA against two registration methods. The first is FaceScape topologically uniform(TU) meshes obtained through a variant method of ICP [43] which is the standard surface registration method in 3D face domain [19]. The second is the-state-of-the-art deep learning based method NPHM [16], which is trained on a dataset comprising 87 subjects and 23 different facial expressions.

Quantitative and qualitative comparisons of geometric alignment are presented in the left of Tab. 1 and Fig. 5. The statistical analysis demonstrates the overall superior performance of our method. As illustrated in Fig. 5, our method achieves higher fidelity, even in challenging areas such as the lips and eyes, and captures finer details of wrinkles on the jaws and cheeks. The primary limitation of NPHM lies in the lack of fidelity in capturing details. This deficiency is particularly evident in errors concentrated around the mouth of subjetc 212 and wrinkles of subject 344. Figure 5 also identifies two main drawbacks in FaceScape TU meshes. Firstly, subject 122’s lip-funneler expression, where the eyes should be closed, exhibits the opposite. This is a common issue resulting from inaccurate landmarks in the ICP-based method. Secondly, the face rim areas (forehead and cheeks) in TU meshes are observed with higher geometric errors due to excessive conformation to the template shape, which is related to the issue of selecting suitable parameters for appropriate global settings.

Table 2: The ablation studies to demonstrate the effects of each constraint from HRA. The geometric performance degrades when one of three constraints is not equipped, validating the contributions of the three constrains in our GPJA.

	Geometric Metric↓	PSNR↑	SSIM↑	LPIPS↓
GPJA	0.167	24.06	0.7523	0.07511
Ablation
w/o $\mathcal{L}_{normal}$	0.377	24.49	0.7498	0.07578
w/o $\mathcal{L}_{depth}$	0.455	23.67	0.7535	0.07544
w/o $\mathcal{L}_{color}$	0.447	22.04	0.7364	0.07820

Result Analysis on Photometric Alignment. Previous registration methods are prone to neglect photometric information, resulting in a failure to consistently preserve texture parameterization. The FaceScape TU meshes are accompanied by texture maps for each expression. Nevertheless, inconsistencies in the texture maps of the same subject across various expressions are commonly observed (Fig. 5(a)), illustrating the failure of TU meshes in maintaining photometric alignment. Deep learning methods trained on aligned meshes processed by ICP-based techniques also struggle to address this issue. Notably, Fig. 5(b) reveals another common drawback due to the overlook of photometric features: in particular, NPHM outputs false lips that should be invisible. While this issue is obvious in photometric appearances, it is not apparent in geometric evaluations.

On the contrary, in GPJA, photometric consistency is ensured through applying the shared color texture to fit reference images of diverse expressions. Comparison between the ground truth and our rendered images in Fig. 7 convincingly demonstrates the photometric alignment achieved through our registration. Notably, the rendered images exhibit pixel-level overlap** consistency in characteristic areas such as the mouth, eyes, and mole features, across various facial expressions despite occlusion changes. The last row of Fig. 7 illustrates that the discrepancy between the ground truth and the rendered images is primarily attributed to variations in skin tone across facial expressions. Additionally,the right of Tab. 1 provides the image metrics for our experiment. The quantitative results are superior than PSNR of 23.61, SSIM of 0.6460, and LPIPS of 0.09677 achieved by the NeRF-style pipeline NeP [26] (tested on the first 100 subjects of FaceScape), which produces photo-realistic reconstruction results through per-frame color generation.

Ablations. The HRA based on multiple cues plays a crucial role in correctly war** the template into joint alignment. To validate HRA’s effectiveness, we conduct two ablation experiments to confirm its benefits utilizing four publishable subjects of FaceScape.

We first validate the contribution of each constraints from HRA mechanism, which is supported by Tab. 2. In the case where all constraints are utilized, the geometric error is minimized. As a visualized example, Fig. 7(a) reveals that the normal term significantly contributes to the sharpness of details and alleviates incorrect bumps on texture-less regions like cheeks. However, Tab. 2 show close image metrics in comparison. We speculate the reason is that with the color constraint guiding the deformation, the geometric distortion can not manifest itself in color renderings.

In the second ablation, we study the effectiveness of the masking mechanism for color constraint. We synthesize the label image for the mouth socket using $F_{S}(\boldsymbol{x}|\boldsymbol{P}_{j},\boldsymbol{B})$ , and overlap it with the ground truth to examine the contours around the inner mouth. Figure 7(b) illustrates that when the inner mouth is not masked out, the vertices around it are disturbed and become distorted. Hence, intentionally masking out the inner mouth facilitates correct tracing of the mouth contours.

6 Conclusion

We present an innovative geometric-photometric joint alignment approach for facial mesh registration through the utilization of differentiable rendering techniques, demonstrating robust performance under various facial expressions with occlusion changes. Our method addresses this challenging task by designing holistic rendering alignment with multiple cues, optimized using a multiscale regularized algorithm. Unlike previous methods, our semantic annotation-free approach does not require marker point tracking or a set of aligned meshes for training. It is fully automatic and efficiently executed on a consumer GPU with fast convergence. Experimental results show that our aligned meshes achieve high geometric accuracy, surpassing conventional ICP-based techniques amd the state-of-the-art method NPHM. We also validate the photometric alignment by comparing rendered images with captured multi-view images, demonstrating pixel-level alignment in key facial areas, including the eyes, mouth, nostrils, and even freckles.

Our method has some limitations. Firstly, we observed that certain moles and freckles are too small to impose significant constraints, leading to the deformation ceasing before achieving pixel-level alignment. Secondly, while the mouth-open template can be effectively registered to various expressions, there are instances where the presence of teeth and tongues occasionally misguides the mouth contours.

While our method currently focuses on static facial scans with a single mouth-open template, future extensions can consider multi-view video sequences. Incorporating temporal information can address skin tone changes during performance and improve deformation accuracy. Additionally, refining the rendering function to include more complex effects, and exploring advanced inverse rendering pipelines for reconstructing material properties simultaneously are potential avenues for future researches.

References

Amberg et al. [2007] Brian Amberg, Sami Romdhani, and Thomas Vetter. Optimal step nonrigid ICP algorithms for surface registration. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA. IEEE Computer Society, 2007.
Athar et al. [2022] ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli Shechtman, and Zhixin Shu. Rignerf: Fully controllable neural 3d portraits. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 20332–20341. IEEE, 2022.
Baden et al. [2018] Alex Baden, Keenan Crane, and Misha Kazhdan. Möbius registration. Comput. Graph. Forum, 37(5):211–220, 2018.
Beeler et al. [2011] Thabo Beeler, Fabian Hahn, Derek Bradley, Bernd Bickel, Paul A. Beardsley, Craig Gotsman, Robert W. Sumner, and Markus H. Gross. High-quality passive facial performance capture using anchor frames. ACM Trans. Graph., 30(4):75, 2011.
Bermano et al. [2015] Amit Bermano, Thabo Beeler, Yeara Kozlov, Derek Bradley, Bernd Bickel, and Markus H. Gross. Detailed spatio-temporal reconstruction of eyelids. ACM Trans. Graph., 34(4):44:1–44:11, 2015.
Bolkart et al. [2023] Timo Bolkart, Tianye Li, and Michael J. Black. Instant multi-view head capture through learnable registration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 768–779. IEEE, 2023.
Botsch and Kobbelt [2004] Mario Botsch and Leif Kobbelt. A remeshing approach to multiresolution modeling. In Second Eurographics Symposium on Geometry Processing, Nice, France, July 8-10, 2004, pages 185–192. Eurographics Association, 2004.
Cao et al. [2015] Chen Cao, Derek Bradley, Kun Zhou, and Thabo Beeler. Real-time high-fidelity facial performance capture. ACM Trans. Graph., 34(4):46:1–46:9, 2015.
Dai et al. [2020] Hang Dai, Nick E. Pears, William A. P. Smith, and Christian Duncan. Statistical modeling of craniofacial shape and texture. Int. J. Comput. Vis., 128(2):547–571, 2020.
Deng et al. [2022] Bailin Deng, Yuxin Yao, Roberto M. Dyke, and Juyong Zhang. A survey of non-rigid 3d registration. Comput. Graph. Forum, 41(2):559–589, 2022.
Desbrun et al. [1999] Mathieu Desbrun, Mark Meyer, Peter Schröder, and Alan H. Barr. Implicit fairing of irregular meshes using diffusion and curvature flow. In Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 1999, Los Angeles, CA, USA, August 8-13, 1999, pages 317–324. ACM, 1999.
Dinev et al. [2018] Dimitar Dinev, Thabo Beeler, Derek Bradley, Moritz Bächer, Hongyi Xu, and Ladislav Kavan. User-guided lip correction for facial performance capture. Comput. Graph. Forum, 37(8):93–101, 2018.
Fyffe et al. [2017] Graham Fyffe, Koki Nagano, Loc Huynh, Shunsuke Saito, Jay Busch, Andrew Jones, Hao Li, and Paul E. Debevec. Multi-view stereo on consistent face topology. Comput. Graph. Forum, 36(2):295–309, 2017.
Gafni et al. [2021] Guy Gafni, Justus Thies, Michael Zollhöfer, and Matthias Nießner. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8649–8658. Computer Vision Foundation / IEEE, 2021.
Garrido et al. [2016] Pablo Garrido, Michael Zollhöfer, Chenglei Wu, Derek Bradley, Patrick Pérez, Thabo Beeler, and Christian Theobalt. Corrective 3d reconstruction of lips from monocular video. ACM Trans. Graph., 35(6):219:1–219:11, 2016.
Giebenhain et al. [2023] Simon Giebenhain, Tobias Kirschstein, Markos Georgopoulos, Martin Rünz, Lourdes Agapito, and Matthias Nießner. Learning neural parametric head models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 21003–21012. IEEE, 2023.
Gotardo et al. [2018] Paulo F. U. Gotardo, Jérémy Riviere, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. Practical dynamic facial appearance modeling and acquisition. ACM Trans. Graph., 37(6):232, 2018.
Harltey and Zisserman [2006] Andrew Harltey and Andrew Zisserman. 6.2.3 depth of points. In Multiple view geometry in computer vision (2. ed.). Cambridge University Press, 2006.
Jung et al. [2023] Yucheol Jung, Hyomin Kim, Gyeongha Hwang, Seung-Hwan Baek, and Seungyong Lee. Mesh density adaptation for template-based shape reconstruction. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 2023, Los Angeles, CA, USA, August 6-10, 2023, pages 53:1–53:10. ACM, 2023.
Laine et al. [2020] Samuli Laine, Janne Hellsten, Tero Karras, Yeongho Seol, Jaakko Lehtinen, and Timo Aila. Modular primitives for high-performance differentiable rendering. ACM Trans. Graph., 39(6):194:1–194:14, 2020.
Lee and Kazhdan [2019] Sing Chun Lee and Misha Kazhdan. Dense point-to-point correspondences between genus-zero shapes. Comput. Graph. Forum, 38(5):27–37, 2019.
Li et al. [2008] Hao Li, Robert W. Sumner, and Mark Pauly. Global correspondence optimization for non-rigid registration of depth scans. Comput. Graph. Forum, 27(5):1421–1430, 2008.
Li et al. [2021] Tianye Li, Shichen Liu, Timo Bolkart, Jiayi Liu, Hao Li, and Yajie Zhao. Topologically consistent multi-view face inference using volumetric sampling. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, pages 3804–3814. IEEE, 2021.
Liu et al. [2017] Songrun Liu, Zachary Ferguson, Alec Jacobson, and Yotam I. Gingold. Seamless: seam erasure and seam-aware decoupling of shape from mesh resolution. ACM Trans. Graph., 36(6):216:1–216:15, 2017.
Liu et al. [2022] Shichen Liu, Yunxuan Cai, Haiwei Chen, Yichao Zhou, and Yajie Zhao. Rapid face asset acquisition with recurrent feature alignment. ACM Trans. Graph., 41(6):214:1–214:17, 2022.
Ma et al. [2022] Li Ma, Xiaoyu Li, **g Liao, Xuan Wang, Qi Zhang, Jue Wang, and Pedro V. Sander. Neural parameterization for dynamic human head editing. ACM Trans. Graph., 41(6):236:1–236:15, 2022.
Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, pages 405–421. Springer, 2020.
Munkberg et al. [2022] Jacob Munkberg, Wenzheng Chen, Jon Hasselgren, Alex Evans, Tianchang Shen, Thomas Müller, Jun Gao, and Sanja Fidler. Extracting triangular 3d models, materials, and lighting from images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 8270–8280. IEEE, 2022.
Nguyen [2007] Hubert Nguyen. Chapter 22. baking normal maps on the gpu. In Gpu Gems 3. Addison-Wesley Professional, 2007.
Nicolet et al. [2021] Baptiste Nicolet, Alec Jacobson, and Wenzel Jakob. Large steps in inverse rendering of geometry. ACM Trans. Graph., 40(6):248:1–248:13, 2021.
Nimier-David et al. [2019] Merlin Nimier-David, Delio Vicini, Tizian Zeltner, and Wenzel Jakob. Mitsuba 2: a retargetable forward and inverse renderer. ACM Trans. Graph., 38(6):203:1–203:17, 2019.
Pears et al. [2023] Nick E. Pears, Hang Dai, William A. P. Smith, and Hao Sun. Laplacian ICP for progressive registration of 3d human head meshes. In 17th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2023, Waikoloa Beach, HI, USA, January 5-8, 2023, pages 1–7. IEEE, 2023.
Prada et al. [2016] Fabian Prada, Misha Kazhdan, Ming Chuang, Alvaro Collet, and Hugues Hoppe. Motion graphs for unstructured textured meshes. ACM Trans. Graph., 35(4):108:1–108:14, 2016.
Pumarola et al. [2021] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 10318–10327. Computer Vision Foundation / IEEE, 2021.
Ramamoorthi and Hanrahan [2001] Ravi Ramamoorthi and Pat Hanrahan. An efficient representation for irradiance environment maps. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 2001, Los Angeles, California, USA, August 12-17, 2001, pages 497–500. ACM, 2001.
Riviere et al. [2020] Jérémy Riviere, Paulo F. U. Gotardo, Derek Bradley, Abhijeet Ghosh, and Thabo Beeler. Single-shot high-quality facial geometry and skin appearance capture. ACM Trans. Graph., 39(4):81, 2020.
Rusinkiewicz and Levoy [2001] Szymon Rusinkiewicz and Marc Levoy. Efficient variants of the ICP algorithm. In 3rd International Conference on 3D Digital Imaging and Modeling (3DIM 2001), 28 May - 1 June 2001, Quebec City, Canada, pages 145–152. IEEE Computer Society, 2001.
Schmidt et al. [2022] Patrick Schmidt, Janis Born, David Bommes, Marcel Campen, and Leif Kobbelt. Tinyad: Automatic differentiation in geometry processing made simple. Comput. Graph. Forum, 41(5):113–124, 2022.
Sorkine and Alexa [2007] Olga Sorkine and Marc Alexa. As-rigid-as-possible surface modeling. In Proceedings of the Fifth Eurographics Symposium on Geometry Processing, Barcelona, Spain, July 4-6, 2007, pages 109–116. Eurographics Association, 2007.
Sorkine et al. [2004] Olga Sorkine, Daniel Cohen-Or, Yaron Lipman, Marc Alexa, Christian Rössl, and Hans-Peter Seidel. Laplacian surface editing. In Second Eurographics Symposium on Geometry Processing, Nice, France, July 8-10, 2004, pages 175–184. Eurographics Association, 2004.
Tam et al. [2013] Gary K. L. Tam, Zhi-Quan Cheng, Yu-Kun Lai, Frank C. Langbein, Yonghuai Liu, A. David Marshall, Ralph R. Martin, Xianfang Sun, and Paul L. Rosin. Registration of 3d point clouds and meshes: A survey from rigid to nonrigid. IEEE Trans. Vis. Comput. Graph., 19(7):1199–1217, 2013.
Wu et al. [2016] Chenglei Wu, Derek Bradley, Markus H. Gross, and Thabo Beeler. An anatomically-constrained local deformation model for monocular face capture. ACM Trans. Graph., 35(4):115:1–115:12, 2016.
Yang et al. [2020] Haotian Yang, Hao Zhu, Yanru Wang, Mingkai Huang, Qiu Shen, Ruigang Yang, and Xun Cao. Facescape: A large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 598–607. Computer Vision Foundation / IEEE, 2020.
Zhang et al. [2022] Longwen Zhang, Chuxiao Zeng, Qixuan Zhang, Hongyang Lin, Ruixiang Cao, Wei Yang, Lan Xu, and **gyi Yu. Video-driven neural physically-based facial asset for production. ACM Trans. Graph., 41(6):208:1–208:16, 2022.
Zheng et al. [2022] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C. Bühler, Xu Chen, Michael J. Black, and Otmar Hilliges. I M avatar: Implicit morphable head avatars from videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 13535–13545. IEEE, 2022.