(accv) Package accv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: Zhejiang University ²²institutetext: vivo AI Lab

FAGhead: Fully Animate Gaussian Head from Monocular Videos

Yixin Xuan 11 Xinyang Li 11 Gongxin Yao 11 Shiwei Zhou 22 Donghui Sun 22 Xiaoxin Chen 22 Yu Pan Corresponding Author. Email: [email protected]

Abstract

High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works.

Keywords:

3D Face Reconstruction Facial Animation Facial Expression Synthesis

Refer to caption — Figure 1: Given the monocular video, our proposed FAGhead approach is able to generate high-fidelity avatars and the corresponding alpha map. By leveraging the novel Point-based Learnable Representation Field, FAGhead ensures photorealistic reanimation and extends generalization to novel expressions and head poses.

1 Introduction

3D head avatars reconstruction from monocular video has witnessed a significant surge in recent decades, driven by a host of applications such as 3D content creation [5], virtual reality(VR) technology [48] and gaming [54], which is a challenge in the computer vision field. With the development of digital human, the demand for automated photo-realistic avatars synthesis has become more and more prevalent.

The previous works mainly exploit the 3D morphable models(3DMMs) [2, 33] representation, focusing on the shape and expression transform [25, 59, 10] to match the original avatars. However, in mono-view settings, these method fail to meet photorealistic requirements and require accurate geometry meshes as priors, which limits their applications.

The advancements in the field of geometry reconstruction have significantly enhanced the accuracy of geometry synthesis. Neural Radiance Field(NeRF) [29] is the most representative work in this field, showing the great capability with complex objects and leading to more high-quality result. Some method [30, 31] produce the photo-realistically human avatars by optimizing an additional continuous volumetric deformation field, while other method [1, 12, 24] combine with the traditional 3DMMs approach and have the capacity to generalize to novel deformations [14]. However, the volume rendering approach, which relies on extensive sampling and alpha compositing, constrain the speed of inference.

Thus, the recent 3D Gaussian Splatting(3DGS) [20] utilized a set of 3D Gaussian points to describe 3D real-world scene, assigning the 3D Gaussian points with variable proprieties, demonstrated the feasibility of photo-realistic novel view synthesis and high efficiency. As for the animated avatars synthesis, maintain approaches [51, 47, 8, 36] are creating a deform field from canonical to deformation space with the use of a Multi-Layer Perceptron(MLP). Although these approaches have made a profound advance in photo-realistic avatars synthesis, they are unable to decouple identity and expression information effectively, which lead to unreasonable results when facing animation tasks with novel expression.

To overcome this issue and further improve the animation quality, we propose the FAGhead, a novel method based on 3DMMs representation for high-fidelity avatars construction and animation. In spirited by previous works [23, 34], which fully explicit the FLAME [25] model via linear blend skinning (LBS) in multi-view camera setting, we expand it in monocular setting. Regarding decoupling, we separate identity and expression information during preprocessing through a modified face tracker [59].

Regarding Gaussian initialization, we propose a novel Point-based Learnable Representation Field (PLRF) approach that positions Gaussian points along the midline of a single triangle face, thereby increasing the density of Gaussian points and enhancing facial expression details. Specifically, instead of initializing the 3D Gaussian points at the center of each triangle face of the FLAME mesh, we sample Gaussian points with learnable positions along the line segments connecting the centroid to each vertex of the triangle face within each avatar mesh.

Moreover, a transform network is built to match the dynamic facial movements from canonical point-based field to the transfrom point-bese field. In practice, it takes pre-retrieved FLAME parameters as conditions to produce the facial movement deformation. Besides, to enhance the rendering edge performance as hair and shoulder, we introduce the alpha loss between alpha map and the rendered. With the assistance of these enhancements, FAGhead aachieves higher-fidelity rendering and provides fully controllable avatars over facial expressions and head poses. In summary, our contributions are as follows:

$\bullet$

We propose Fully Animate Gaussian Head, a novel head avatar synthesis approach with effective representation field, which could provide fully controllable head avatars and achieve high-fidelity face reenactment.
$\bullet$

We propose a Point-based Learnable Representation Field(PLRF), which significantly enhance the generated quality and geometry detail structure.
$\bullet$

We introduce the transform network to fit the deformation defined by 3DMM model.
$\bullet$

We redesign the model structure and propose the additional loss item as alpha loss item and other regularization loss to enhance the performance of geometry reconstruction.

2 Related Work

2.1 Scene Reconstruction and Novel View Synthesis

Early novel view synthesis mainly focuses on image interpolation and image-based rendering(IBR) [39, 55], which achieves novel view synthesis by image interpolation between nearest-neighbour images. In the recent works, Significant advancements have been made in the fields of Structure from Motion (SfM) [38] and Multi-view Stereo (MVS) [44], both of which utilize explicit 3D scene representations through RGB images [3, 40, 15]. Besides, COLMAP [38] plays a significant role, which provides the reliable 3D space feature points matched across multi-view images. With development of neural rendering [42], more effective and realistic approaches are proposed, especially Neural radiance fields (NeRF) [29]. NeRF achieves high realistic reconstruction with manipulating an MLP network as 3D space expression and volume rendering.

To enhance the efficiency of NeRF, Point-NeRF [50] combines a neural point cloud initialized via a deep network, accelerating training time via a novel point cloud pruning and growing mechanism. NSVF [26] uses a sparse voxel octree to represent the scene which avoid the computational waste. Meanwhile, a various of advanced NeRFs employing another explicit representations [11, 53, 41, 6, 17] are proposed to overcome slow speed issue. However, there is still a host of room for improvement, the render speed cannot meet the real-time requirement. 3D Gaussian Splatting [20] provides a faster and more efficient method for reconstructing the 3D scene, exhibiting a easy and effective performance with high quality.

2.2 3D Parameter Head Model

Building on the premise that the human head shape space can be effectively separated into identity, expression, and appearance components. The 3D Morphable Model (3DMM) [2] is proposed, which is used to embed 3D head shapes into multiple low-dimensional principal component analysis (PCA) spaces. Subsequent research [4, 46, 37, 16] has expanded this mesh-based parametric head model, enhancing its representational capacity through the development of multi-linear [4] and non-linear models [45], as well as articulated models equipped with corrective blendshapes.

Recent cutting-edge techniques have advanced in accurately capturing the intricate deformations associated with facial expressions by incorporating additional displacement maps that respond to the input images [9, 10]. In addition, generative models powered by machine learning, such as GANs [18] and StyleGAN [19], have been integrated into current frameworks [13, 28] to refine the precision in modeling facial textures and geometries. However, despite these advances, the parametric models are still mostly limited to capturing the facial region’s geometry and appearance at a rather basic level through explicit mesh models. This limitation detracts from the photorealistic quality of reconstructions and animations based on these models [14].

2.3 3D Head Portrait Synthesis

3D head portrait synthesis could be divide into explicit and implicit representations. The explicit representations, primarily based on mesh models [4], which have been evolving for many years. In recent efforts, some researchers [21, 43] have utilized 2D neural rendering techniques for creating photo-realistic portraits, although these method often overlook non-facial areas or face challenges with temporal and spatial inconsistencies due to their weak integration with 3D geometry. Other approaches [10, 14] have focused on learning vertex offsets to capture the detailed head geometry more accurately, but they can still encounter issues with geometry and texture artifacts in complex areas like hair, eyes, and mouth, limited by the mesh model’s representational capacity and the challenges of differentiable rendering. PointAvatar [58] introduced a novel, deformable point-based approach, overcoming some mesh model limitations, albeit at the cost of requiring an excessive number of points and extensive training periods. Implicit models, on the other hand, utilize neural functions to create digital head avatars, with significant research [57] dedicated to achieving high fidelity, though often at the expense of training and inference efficiency. Innovations such as volumetric primitives [27] and local feature grids [12, 52, 60] have been proposed to enhance efficiency and reduce the computational load. Moreover, FlashAvatar [49] proposed the UV sample strategy to enhance rendering efficiency.

3 Preliminary

There is a enormous difference compare 3D Gaussian Splatting [20] with the widely adopted Neural Radiance Field. 3DGS utilizes explicit 3D Gaussian points which initializes with the feature points generated by COLMAP [38] as the fundamental entities for rendering. Consider the rendering entities set $\{G_{t}\}^{N}_{t=0}$ , $t$ is index of 3D Gaussian point, $N$ is the number of 3D Gaussian point. Each point has following proprieties $\zeta_{t}=\{\mu_{t},o_{t},r_{t},s_{t},c_{t}\}$ . $\mu_{t}\in\mathbb{R}^{3}$ is the position of Gaussian point, $o_{t}\in\mathbb{R}$ describes the opacity, $r_{t}\in\mathbb{R}^{4}$ indicates the rotation, $s_{t}\in\mathbb{R}^{3}$ reflects the scale of Gaussian point, $c_{t}\in\mathbb{R}^{3}$ denotes the view-dependent color, calculated via spherical harmonic coefficients, which $k$ is related with the degree of spherical harmonic. Each 3D Gaussian point is mathematically defined with a spatial mean $\mu$ and a covariance matrix $\Sigma$ as following:

G_{t}(x)=\exp({-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}),

(1)

During rendering, all 3D Gaussian points are projected to the specified 2D camera plane at first. According to the prior work [61], given a viewing transformation $W$ , the 3D covariance matrix $\Sigma^{\prime}$ can be reasoned as follow:

\Sigma^{\prime}=JW\Sigma W^{T}J^{T},

(2)

where $J$ denotes the Jacobian of the affine approximation of perspective projection transformation.

Thus, the color $\hat{C}$ of specified pixel can be synthesized as:

\hat{C}(x)=\sum_{t\in M}c_{t}\alpha_{t}(x)\prod_{j=1}^{t-1}(1-\alpha_{j}(x))

(3)

with

\alpha_{t}(x)=o_{t}\exp({-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)})

(4)

To optimize shared memory usage, 3DGS has developed a GPU-optimized rasterization process that assigns each thread block to an image tile. This innovative approach not only allows for realistic scene reconstruction but also delivers considerably faster rendering speeds and decreases memory consumption during training, outperforming NeRF approach.

4 Method

The proposed model is outlined in Fig. 2. Firstly, we preprocess the given monocular video into FLAME parameters as detailed in Sec. 4.1. During the initialization stage (Sec. 4.2), we employ the Point-based Learnable Representation Field aligned with the canonical mesh, where the position dynamically adjusts throughout training. Subsequently, we fit the deformation between the canonical space and the current frame utilizing the transform network (Sec. 4.3). To further enhance the geometry reconstruction performance, we introduce alpha rendering (Sec. 4.4).

4.1 Data Preprocessing

Given a monocular video $V$ consisting of images $I=\{I_{i}\}$ , our objective is to extract the camera parameters including the intrinsic parameters $K_{i}$ , the camera poses $C_{i}$ , the FLAME meshes $M_{i}$ and corresponding properties set $\rho_{i}$ like shape, expression ,and the pose like jaw, neck, and eyes. Given that a monocular video provides only a single camera view, it inherently contains less information than the datasets used in Nersemble [23]. This limitation necessitates more rigorous data preprocessing. In our experiments, the failure to align the initialized parametric mesh with the ground truth image results in disorganized results.

To adapt effectively to the monocular video setting, we fix the neck pose during processing and solely optimize the camera pose relative to the head, diverging from the approach in GaussianAvatars [34]. Moreover, we replace the screen coordinates to Normalized Device Coordinates (NDC) using pytorch3D [35], ensuring greater compatibility with 3DGS. Our optimization focuses on the field of view (FOV) while kee** the properties $Z_{near}$ and $Z_{far}$ . During preprocessing, we initially optimize the identity information within the FLAME2020 framework and subsequently optimize expressions and other properties while kee** the identity information fixed, effectively decoupling identity from expression information. Further details are available in the supplementary materials.

4.2 Point-based Learnable Representation Field

Given the shape, pose and expression components, a morphologically realistic mesh can be produced via FLAME framework. The key is how to establish connection between the FLAME mesh with 3D Gaussian Splatting effectively. Motivated by GaussianAvatars [34], we build a Point-based Learnable Representation Field(PLRF) on the original meshes in order to augment the original initialized Gaussian point field. As shown in Fig. 3, instead of initializing one Gaussian point at the center of each triangle face, we adopt the learnable strategy. Given the three vertices of a triangle set $\{x_{1},x_{2},x_{3}\}$ , we first take their mean position $\bar{x}$ by:

\bar{x}=mean(x_{1},x_{2},x_{3})

(5)

Via the barycentric coordinate $\bar{x}$ , we get the lines with the three vertices of a triangle set $\{\bar{x}x_{1},\bar{x}x_{2},\bar{x}x_{3}\}$ . Besides, we allocate the Gaussian points on the lines by:

$\displaystyle x^{\prime}_{1}$	$\displaystyle=(1-n)\bar{x}+nx_{1}$	(6)
$\displaystyle x^{\prime}_{2}$	$\displaystyle=(1-n)\bar{x}+nx_{2}$
$\displaystyle x^{\prime}_{3}$	$\displaystyle=(1-n)\bar{x}+nx_{3}$

where $n$ is a learned parameter ranging $[0,1]$ with is initialized at 0.5. Thus, the new positions of the Gaussian point is represented as $\{x^{\prime}_{1},x^{\prime}_{2},x^{\prime}_{3},\bar{x}\}$ .

We attribute four Gaussian point at each triangle face when initialization, which is equivalent to splitting a complete triangle face into four sub-face. Additionally, we count the Gaussian points separately which are cloned during training. It is this approach named as PLRF that enhancing the density of the Gaussian points and avoiding the hole on the rendered avatars.

After reconstructing the point-based field, we adopt the global transform which transform the local Gaussian points from triangle face to the global space represented as:

$\displaystyle r^{g}$	$\displaystyle=\mathbf{R}r$	(7)
$\displaystyle\mu^{g}$	$\displaystyle=\frac{1}{2}\mathbf{k}\mathbf{R}\mu+\{x^{\prime}_{1},x^{\prime}_{% 2},x^{\prime}_{3},\bar{x}\}$	(8)
$\displaystyle s^{g}$	$\displaystyle=\frac{1}{2}\mathbf{k}ns$	(9)

where $\{r,\mu,s\}$ are the Gaussian raw attributes in local space, $\{r^{g},\mu^{g},s^{g}\}$ are the Gaussian attributes in global space, $\mathbf{R}=\{R,R,R,R\}$ , $\mathbf{k}=\{k,k,k,k\}$ . We define $R$ as global rotation represent as the orientation of the triangle in the global space, $k$ by the mean length of one of the edges and its perpendicular as the triangle scaling. Here, we repeat the original scale and rotation four times to fit the alteration on positions. Each triangle owns its unique triangle scaling factor $k$ and global rotation $R$ , which are fixed during training.

Besides, we adopt the adaptive density control and growth. For each 3D Gaussian that exhibits a large view-space positional gradient, we either split it into two smaller Gaussian points if it is large, or clone it if it is small.

Because of the learnable sampling strategy, to prevent the scale of a Gaussian point within a triangle face from being too large, we introduce the scale regularization on the scale prosperity, which we will describe in Sec. 4.5.

4.3 Transform Network

Here, we get the global Gaussian attribute in the global space $r^{g}$ , $\mu^{g}$ , $s^{g}$ , which attach on mesh. However, there are still non-surface regions and subtle facial details can not be modeled by FLAME framework. To fit the Gaussian deformation due to the expression motions, we employ a standard MLP network $F_{\theta}$ for Gaussian deformation transform. Here, we define the Point-based Learnable Representation Field at timestep 0 as initial canonical space. The MLP network take the position of initial canonical space $\mu^{g}_{0}$ and current FLAME properties $\rho_{i}$ as expression and the pose except the shape property as input, and outputs spatial residuals of Gaussian attribute:

\displaystyle\delta\mu_{i},\delta s_{i},\delta r_{i}=F_{\theta}(\gamma(\mu^{g}% _{0}),\rho_{i})

(10)

where $\theta$ represents the optimized parameters of the standard Multi-Layer Perceptron (MLP), $\gamma$ represents the positional encoding of the spatial coordinates of the 3D Gaussian into a high-dimensional sequence as described in NeRF[29]. Here, we choose 10 as the frequency of the positional encoding. Final, the current spatial parameters which are input into the final render processing can be represented as:

\displaystyle\mu^{g}_{i},s^{g}_{i},r^{g}_{i},=\mu^{g}_{0}\oplus\delta\mu_{i},s% ^{g}_{0}\oplus\delta s_{i},r^{g}_{0}\oplus\delta r_{i}

(11)

4.4 Geometry Enhancement

The hair strands or other model non-facial structures such as eyeglasses and hairstyles make a profound impact on the photo-realistic rendering. In this work, we incorporate a alpha map $\hat{A}$ as each ground true view. We introduce the alpha rendering as follow:

\hat{A}(x)=\sum_{i\in M}\alpha_{i}(x)\prod_{j=1}^{i-1}(1-\alpha_{j}(x))

(12)

We then encourage the consistency between the rendered alpha map $\hat{A}$ and the pseudo alpha map $A$ , which is quantified as follows:

\mathcal{L}_{\alpha}=\lambda_{\alpha}\|\hat{A}-A\|^{2}

(13)

where $\lambda_{\alpha}$ is a hyperparameter equal to 0.5 and the pseudo $A$ is obtained from SGHM [7].

4.5 Optimization Scheme

Given the rendered image $\hat{C}$ via Eq. 3 and the ground truth image $C$ , we utilize the L1 loss $\mathcal{L}_{l1}$ and a D-SSIM term [56] to supervise the pixel-wise difference by:

	$\displaystyle\mathcal{L}_{color}=(1-\lambda_{ssim})$	$\displaystyle\mathcal{L}_{l1}+\lambda_{ssim}\mathcal{L}_{D-SSIM}$		(14)
	$\displaystyle\mathcal{L}_{l1}$	$\displaystyle=\\|\hat{C}-C\\|^{2}$		(14)

where $\lambda_{ssim}$ is a hyperparameter equal to 0.4.

Besides, to maintain the structure information of the rendered image $\hat{C}$ with ground truth image $C$ , we additionally add a structure loss $\mathcal{L}_{st}$ between them, to maintain the detail information and increase the contrast.

\mathcal{L}_{st}=\lambda_{st}((\nabla_{lr}(\hat{C})-\nabla_{lr}(C))^{2}+(% \nabla_{rl}(\hat{C})-\nabla_{lr}(C))^{2}),

(15)

where $\nabla_{lr}$ denotes the gradients of image calculating from left to right, $\nabla_{rl}$ denotes the gradients of image calculating from right to left and $\lambda_{st}$ is a hyperparameter equal to 0.3. Here we manipulate a 2D convolution operation to calculate the gradients. Due to we expand the multi-view setting to single-view, there are some regularization items we introduce to make sure the consistency between the monocular video and the FLAME mesh.

Scale regularization on invisible point. Under the multi-view setting, most Gaussian points around the original FLAME mesh are visible. However, since the constraints of monocular view, which captures only a limited number of points compared to multi-view perspectives, Gaussian points located in positions not visible from the monocular camera view may negatively impact rendering images under novel viewpoints.

Thus, we introduce the invisible point scale regularization by:

\mathcal{L}_{invis}=\|\mathcal{M}*s\|

(16)

\mathcal{M}=\begin{cases}1,\mbox{if radii}<0\\ 0,\mbox{otherwise}\end{cases}

(17)

where $\mathcal{M}$ represent the invisibility mask.

Scale threshold regularization. As for the visible Gaussian points, if the scale of a Gaussian point within a triangle face is too large, it will result in unreasonable jittering artifacts. In order to mitigate this, we add the threshold regularization to the visible Gaussian points:

\mathcal{L}_{scale}=\left\|\max((1-\mathcal{M})s,\xi_{scaling})\right\|

(18)

with $\xi_{scaling}=0.15$ .

Our final loss function is thus:

\mathcal{L}=\mathcal{L}_{color}+\mathcal{L}_{\alpha}+\mathcal{L}_{st}+\lambda_% {invis}\mathcal{L}_{invis}+\lambda_{scale}\mathcal{L}_{scale}

(19)

with $\lambda$ denoting the weights of each regularization term, which are set as follows: $\lambda_{invis}=0.3,\lambda_{scale}=0.15$ .

5 Experiments

5.1 Experiment Setup

Dataset. In the experiment, we use 6 sets of data which mainly released by previous works [49, 12]. Each dataset mainly contains a complete video with 25 FPS and 512*512 resolution. The length of the processed video is between 2 and 3 minutes, roughly 3000-4000 frames. We utilize RemBg for foreground segmentation and SGHM [7] for alpha map. Here, all subjects are open-source.

Besides, we capture our own dataset containing 3 subjects to verify the robustness of our approach using iPhone 15 Pro Max. We use a capture tripod to prevent the phone from shaking. The person sitting in front of the iPhone camera are prompted to rotate their heads and enact various face expressions, the capturing details are shown in Fig. 4. All the captured videos will be cropped and down-sample to 600*600 resolution, each video contain approximately 1500-2000 frames. We use the preprocessing code to extract the Flame parameters. Further details of the data preprocessing pipeline are available in the supplementary material.

Implementation Details. We implement our network with Pytorch [32] and use Adam [22] for parameter optimization. The learning rate of the Gaussian’s parameters is the same of the original implementation, while the learning rate of the transform network is $1e-4$ . We train our model for 120000 iterations. Every 3000 iterations, we reset the opacity prosperity of Gaussian points. Then, we will perform a dynamic adaptation control strategy every 400 rounds until the 60000 iterations. We use a single Nvidia GeForce RTX4090 GPU for all of our experiments.

Baseline. We compare our method with three representative works, as INSTA [60], representing an efficient implicit head representation that creates a surface-embedded dynamic neural radiance field, based on neural graphics primitives; GaussianAvatars [34], a novel approach with Gaussian points bind to drive the Gaussian rendering; Flashavatar [49], utilizing the UV sample as Gaussian initialization and offset network to eliminate the FLAME offset. For a fair comparison, both GaussianAvatars and FlashAvatar employ the FLAME2020 model, and their training settings are similar to ours. All experiments are conducted using a single Nvidia GeForce RTX 4090 GPU.

Table 1: Quantitative comparisons with state-of-the-art head avatar reconstruction method on the open-source datasets.

Method	"ID1"			"ID2"			"ID3"			"ID4"			"ID5"			"ID6"
	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
INSTA [60]	17.28	0.852	0.207	15.95	0.831	0.221	14.63	0.846	0.177	19.84	0.921	0.112	13.83	0.806	0.251	12.56	0.796	0.272
GaussianAvatars [34]	25.12	0.906	0.156	19.75	0.903	0.150	17.77	0.907	0.138	23.56	0.952	0.050	19.02	0.857	0.222	17.63	0.782	0.272
FlashAvatar [49]	28.69	0.911	0.116	25.46	0.902	0.105	23.70	0.938	0.096	25.05	0.951	0.075	26.52	0.916	0.097	24.25	0.821	0.158
Our	28.76	0.934	0.054	25.81	0.935	0.062	22.11	0.939	0.075	26.92	0.963	0.029	25.14	0.918	0.074	23.51	0.866	0.128

Table 2: Quantitative comparisons with state-of-the-art head avatar reconstruction method on our capturing datasets.

Method	"subject1"			"subject2"			"subject3"
	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓	PSNR↑	SSIM↑	LPIPS↓
INSTA [60]	17.38	0.812	0.242	18.07	0.835	0.236	15.31	0.771	0.264
GaussianAvatars [34]	23.02	0.859	0.144	21.63	0.893	0.154	26.49	0.904	0.088
FlashAvatar [49]	26.68	0.860	0.079	27.23	0.907	0.076	28.48	0.900	0.074
Our	27.83	0.900	0.066	28.31	0.929	0.064	30.61	0.941	0.041

5.2 Qualitative and Quantitative Comparison in Reconstruction

Fig. 5 visualizes the alpha rendering result of FAGhead. Fig. 6 and Fig. 7 depict the qualitative comparison between our model and the above method.

INSTA [60] is able to generate renders that are reasonably consistent with articulated facial expression and head-pose. However, INSTA utilizes neural graphics primitives integrated within the FLAME surface, which limits its ability to accurately model accessories such as ties and necklaces(in the 3rd and 5th rows). Besides, it tends to generate artifacts around the mouth(the 2nd and 5th row in Fig. 6 and the 1st row in Fig. 7) and smooth results with ignoring thin structures, especially in the shoulder region.

GaussianAvatars [34] primarily relies on a parametric morphable face model to rig 3D Gaussian splats. It achieves a more precise geometric representation by attaching the 3D Gaussian splats from local to global coordinates within a FLAME mesh, indicating that the rendering performance hinges on the quality of the parametric morphable face model. Thus, it tend to generate artifacts around eyes because of the geometry errors(the 1st row in Fig. 6). Besides, due to exploitation of the parametric morphable face model, it usually leads to a over-smooth results of the accessories(the 3rd and 5th rows in Fig. 6 and the 1st row in Fig. 7).

FlashAvatar [49] employs a uniform 3D Gaussian field embedded in the surface to initialize. However, becasue of the fix number of 3D Gaussian splats without the adaptive density control strategy, artifacts will show up as spikes at the edge of the avatars like shoulder or hair(in the 1st rows). Moreover, blurs will appear around the eye region and mouth region(the 2nd and 3rd row in Fig. 6 and 3rd row in Fig. 7). In contrast, our method produces photorealistic images that closely align with the ground truth, capturing nearly all fine facial details, thin structures, and accessories.

Please see the Tab. 1 and Tab. 2 for the quantitative comparison between our model and other method. The metrics include PSNR, SSIM, and LPIPS [56].

5.3 Novel View Synthesis

To demonstrate the novel view rendering ability of our FAGhead, we freely rotate the camera pose to generate new results from novel rendering views as illustrated in Fig. 8. We first utilize a specified sample from the test dataset with FLAME parameter and camera pose. The multiple viewpoints can be produced via multiplying the camera pose with a novel rotation angle. Then the rendered images with different rendering views are generated by the Gaussian rendering. The novel view results show no artifacts or unrealistic facial expressions, even at intermediate rotation angles.

5.4 Cross-identity Reenactment

The cross-identity reenactment results are presented in Fig. 9. Via replacing the input FLAME parameters of transform network, our approach can conduct the expression reenactment with finer-grained teeth and gaze direction.

Compared with the previous works, GaussianAvatars [34] shows the artifacts in shoulder edge and the unnatural gaze direction. With depending on the offset network, the results generated by FlashAvatar [49] seems a bit over-smooth. As for INSTA [60], despite its deformable neural radiance field, tends to lose details in geometry. More reenactment results are available in the supplementary materials.

Table 3: We systematically ablated several key components and assessed quantitative performance to demonstrate their effectiveness.

Metrics	PSNR↑	SSIM↑	LPIPS↓
w/o PLRF	25.61	0.884	0.239
w/o Transform Network	25.21	0.902	0.104
w/o Alpha rendering	28.11	0.927	0.065
w/o Construct loss $\mathcal{L}_{st}$	28.47	0.931	0.058
Our	28.75	0.934	0.054

5.5 Ablation Study

To validate the effectiveness of our method components, we deactivate each of them and report results in Tab. 3.

Point-based Learnable Representation Field. Without Point-based Learnable Representation Field, we randomly initialize 100000 Gaussian points as for creating Gaussian splatting as canonical space. With directly optimizing the Gaussian point attributes, we obtain the final results with spatial residuals output from the transform network. As shown in the 1st row of Tab. 3 and Fig. 10, the expression details are not well captured, demonstrating the influence of Point-based Learnable Representation Field in preserving fine details.

Transform Network. Without transform network, we just utilize the raw FLAME parameters to fit the transform mesh rely on the linear transformations defined by the LBS formula from the original FLAME model[25]. With the limited geometry information, it causes blurry renderings, leading to huge fidelity loss, as shown in the 2nd row of Tab. 3 and Fig. 11.

Alpha rendering. Without alpha rendering, artifacts will show up as spikes at the edge of the avatars like shoulder or hair. Besides, the hole will appear between the neck and the collar as shown in the 3rd row of Tab. 3 and Fig. 12, which is absolutely unreasonable. Because the alpha rendering make sure the geometry consistency, without it, the Gaussian rendering process relies solely on color information and disregards geometric attributes.

Construct loss $\mathcal{L}_{st}$ . Construct loss $\mathcal{L}_{st}$ maintain the detail information and increase the contrast. We shown the results without the construct loss $\mathcal{L}_{st}$ in the 4th row of Tab. 3.

6 Conclusion and Discussion

FAGhead is a novel approach which achieve high-fidelity reconstruction and full animation of 3D human avatars. We propose the Point-based Learnable Representation Field as the prior to reconstruct portraits and exploit the transform network to fit the deformation, outperforming state-of-the-art method. However, our method still has room for further improvement. One limitation is that it does not model the oral cavity effectively. Moreover, our rendering performance is heavily dependent on the quality of data preprocessing, indicating that we struggle to handle significant errors in this stage effectively. Addressing these issues will be a key focus of our future research efforts.

References

[1] Athar, S., Xu, Z., Sunkavalli, K., Shechtman, E., Shu, Z.: Rignerf: Fully controllable neural 3d portraits. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 20364–20373 (2022)
[2] Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 157–164 (2023)
[3] Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured lumigraph rendering. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 497–504 (2023)
[4] Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20(3), 413–425 (2013)
[5] Chaudhuri, S., Kalogerakis, E., Giguere, S., Funkhouser, T.: Attribit: content creation with semantic attributes. In: Proceedings of the 26th annual ACM symposium on User interface software and technology. pp. 193–202 (2013)
[6] Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)
[7] Chen, X., Zhu, Y., Li, Y., Fu, B., Sun, L., Shan, Y., Liu, S.: Robust human matting via semantic guidance. In: Proceedings of the Asian Conference on Computer Vision (ACCV) (2022)
[8] Chen, Y., Wang, L., Li, Q., Xiao, H., Zhang, S., Yao, H., Liu, Y.: Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558 (2023)
[9] Daněček, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face capture and animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20311–20322 (2022)
[10] Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG) 40(4), 1–13 (2021)
[11] Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5501–5510 (2022)
[12] Gao, X., Zhong, C., Xiang, J., Hong, Y., Guo, Y., Zhang, J.: Reconstructing personalized semantic facial nerf models from monocular video. ACM Transactions on Graphics (TOG) 41(6), 1–12 (2022)
[13] Gecer, B., Ploumpis, S., Kotsia, I., Zafeiriou, S.: Fast-ganfit: Generative adversarial network for high fidelity 3d face reconstruction. arXiv preprint arXiv:2105.07474 (2021)
[14] Grassal, P.W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.: Neural head avatars from monocular rgb videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18653–18664 (2022)
[15] Han, X.F., Laga, H., Bennamoun, M.: Image-based 3d object reconstruction: State-of-the-art and trends in the deep learning era. IEEE transactions on pattern analysis and machine intelligence 43(5), 1578–1604 (2019)
[16] Hu, Y.L., Yin, B.C., Cheng, S.Q., Gu, C.L.: An improved morphable model for 3d face synthesis. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 04EX826). vol. 7, pp. 4362–4367. IEEE (2004)
[17] **, H., Liu, I., Xu, P., Zhang, X., Han, S., Bi, S., Zhou, X., Xu, Z., Su, H.: Tensoir: Tensorial inverse rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 165–174 (2023)
[18] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
[19] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
[20] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4), 1–14 (2023)
[21] Kim, H., Garrido, P., Tewari, A., Xu, W., Thies, J., Niessner, M., Pérez, P., Richardt, C., Zollhöfer, M., Theobalt, C.: Deep video portraits. ACM transactions on graphics (TOG) 37(4), 1–14 (2018)
[22] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[23] Kirschstein, T., Qian, S., Giebenhain, S., Walter, T., Nießner, M.: Nersemble: Multi-view radiance field reconstruction of human heads. ACM Transactions on Graphics (TOG) 42(4), 1–14 (2023)
[24] Li, J., Zhang, J., Bai, X., Zhou, J., Gu, L.: Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7568–7578 (2023)
[25] Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36(6), 194–1 (2017)
[26] Liu, L., Gu, J., Zaw Lin, K., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. Advances in Neural Information Processing Systems 33, 15651–15663 (2020)
[27] Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG) 40(4), 1–13 (2021)
[28] Luo, H., Nagano, K., Kung, H.W., Xu, Q., Wang, Z., Wei, L., Hu, L., Li, H.: Normalized avatar synthesis using stylegan and perceptual refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11662–11672 (2021)
[29] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
[30] Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: Deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5865–5874 (2021)
[31] Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin-Brualla, R., Seitz, S.M.: Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228 (2021)
[32] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
[33] Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3d face model for pose and illumination invariant face recognition. In: 2009 sixth IEEE international conference on advanced video and signal based surveillance. pp. 296–301. Ieee (2009)
[34] Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069 (2023)
[35] Ravi, N., Reizenstein, J., Novotny, D., Gordon, T., Lo, W.Y., Johnson, J., Gkioxari, G.: Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 (2020)
[36] Rivero, A., Athar, S., Shu, Z., Samaras, D.: Rig3dgs: Creating controllable portraits from casual monocular videos. arXiv preprint arXiv:2402.03723 (2024)
[37] Romdhani, S., Pierrard, J.S., Vetter, T.: 3d morphable face model, a unified approach for analysis and synthesis of images. Face Processing: Advanced Modeling and Methods p. 768 (2005)
[38] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)
[39] Shum, H., Kang, S.B.: Review of image-based rendering techniques. In: Visual Communications and Image Processing 2000. vol. 4067, pp. 2–13. SPIE (2000)
[40] Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. In: ACM siggraph 2006 papers, pp. 835–846 (2006)
[41] Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5459–5469 (2022)
[42] Tewari, A., Thies, J., Mildenhall, B., Srinivasan, P., Tretschk, E., Yifan, W., Lassner, C., Sitzmann, V., Martin-Brualla, R., Lombardi, S., et al.: Advances in neural rendering. In: Computer Graphics Forum. vol. 41, pp. 703–735. Wiley Online Library (2022)
[43] Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: Audio-driven facial reenactment. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. pp. 716–731. Springer (2020)
[44] Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. International journal of computer vision 9, 137–154 (1992)
[45] Tran, L., Liu, F., Liu, X.: Towards high-fidelity nonlinear 3d face morphable model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1126–1135 (2019)
[46] Tran, L., Liu, X.: Nonlinear 3d face morphable model. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7346–7355 (2018)
[47] Wang, J., Xie, J.C., Li, X., Xu, F., Pun, C.M., Gao, H.: Gaussianhead: Impressive head avatars with learnable gaussian diffusion. arXiv preprint arXiv:2312.01632 (2023)
[48] Wang, M., Lyu, X.Q., Li, Y.J., Zhang, F.L.: Vr content creation and exploration with deep learning: A survey. Computational Visual Media 6, 3–28 (2020)
[49] Xiang, J., Gao, X., Guo, Y., Zhang, J.: Flashavatar: High-fidelity digital avatar rendering at 300fps. arXiv preprint arXiv:2312.02214 (2023)
[50] Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5438–5448 (2022)
[51] Xu, Y., Chen, B., Li, Z., Zhang, H., Wang, L., Zheng, Z., Liu, Y.: Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. arXiv preprint arXiv:2312.03029 (2023)
[52] Xu, Y., Wang, L., Zhao, X., Zhang, H., Liu, Y.: Avatarmav: Fast 3d head avatar reconstruction using motion-aware neural voxels. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–10 (2023)
[53] Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A.: Plenoctrees for real-time rendering of neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5752–5761 (2021)
[54] Zackariasson, P., Wilson, T.L.: The video game industry: Formation, present state, and future. Routledge (2012)
[55] Zhang, C., Chen, T.: A survey on image-based rendering—representation, sampling and compression. Signal Processing: Image Communication 19(1), 1–28 (2004)
[56] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
[57] Zheng, M., Yang, H., Huang, D., Chen, L.: Imface: A nonlinear 3d morphable face model with implicit neural representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20343–20352 (2022)
[58] Zheng, Y., Yifan, W., Wetzstein, G., Black, M.J., Hilliges, O.: Pointavatar: Deformable point-based head avatars from videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21057–21067 (2023)
[59] Zielonka, W., Bolkart, T., Thies, J.: Towards metrical reconstruction of human faces. In: European Conference on Computer Vision. pp. 250–269. Springer (2022)
[60] Zielonka, W., Bolkart, T., Thies, J.: Instant volumetric head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4574–4584 (2023)
[61] Zwicker, M., Pfister, H., Van Baar, J., Gross, M.: Ewa volume splatting. In: Proceedings Visualization, 2001. VIS’01. pp. 29–538. IEEE (2001)

Supplementary Material

7 Implementation Details

7.1 Network Architecture

In Fig. 13, we show the transform network structure, we mainly utilize a set of Fully Connected(FC) layers with Tanh activation function.

7.2 Datapreprocess Pipeline

The data preprocessing pipeline is illustrated in Fig. 14. After obtaining the original capture videos, we first crop the head region. The background is then removed using a segmentation algorithm, and the shape vectors of the mesh are extracted via the MICA. Finally, the FLAME parameters are extracted using our modified tracker based on the Metrical Photometric Tracker.

Here, we conduct an ablation study as shown in Fig. 16, where we utilize the original tracker to extract the FLAME parameters. However, this leads to misalignment of the initialized mesh with the ground truth image, resulting in disorganized results.

8 Additional Results

8.1 Additional Reenactment Results

In Fig. 16, we present the additional reenactment results on our capturing datasets. Our approach demonstrates superior performance compared to the previous works.