(accv) Package accv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: Zhejiang University 22institutetext: vivo AI Lab

FAGhead: Fully Animate Gaussian Head from Monocular Videos

Yixin Xuan 11    Xinyang Li 11    Gongxin Yao 11    Shiwei Zhou 22    Donghui Sun 22    Xiaoxin Chen 22    Yu Pan Corresponding Author. Email: [email protected]
Abstract

High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works.

Keywords:
3D Face Reconstruction Facial Animation Facial Expression Synthesis
Refer to caption
Figure 1: Given the monocular video, our proposed FAGhead approach is able to generate high-fidelity avatars and the corresponding alpha map. By leveraging the novel Point-based Learnable Representation Field, FAGhead ensures photorealistic reanimation and extends generalization to novel expressions and head poses.

1 Introduction

3D head avatars reconstruction from monocular video has witnessed a significant surge in recent decades, driven by a host of applications such as 3D content creation [5], virtual reality(VR) technology [48] and gaming [54], which is a challenge in the computer vision field. With the development of digital human, the demand for automated photo-realistic avatars synthesis has become more and more prevalent.

The previous works mainly exploit the 3D morphable models(3DMMs) [2, 33] representation, focusing on the shape and expression transform [25, 59, 10] to match the original avatars. However, in mono-view settings, these method fail to meet photorealistic requirements and require accurate geometry meshes as priors, which limits their applications.

The advancements in the field of geometry reconstruction have significantly enhanced the accuracy of geometry synthesis. Neural Radiance Field(NeRF) [29] is the most representative work in this field, showing the great capability with complex objects and leading to more high-quality result. Some method [30, 31] produce the photo-realistically human avatars by optimizing an additional continuous volumetric deformation field, while other method [1, 12, 24] combine with the traditional 3DMMs approach and have the capacity to generalize to novel deformations [14]. However, the volume rendering approach, which relies on extensive sampling and alpha compositing, constrain the speed of inference.

Thus, the recent 3D Gaussian Splatting(3DGS) [20] utilized a set of 3D Gaussian points to describe 3D real-world scene, assigning the 3D Gaussian points with variable proprieties, demonstrated the feasibility of photo-realistic novel view synthesis and high efficiency. As for the animated avatars synthesis, maintain approaches [51, 47, 8, 36] are creating a deform field from canonical to deformation space with the use of a Multi-Layer Perceptron(MLP). Although these approaches have made a profound advance in photo-realistic avatars synthesis, they are unable to decouple identity and expression information effectively, which lead to unreasonable results when facing animation tasks with novel expression.

To overcome this issue and further improve the animation quality, we propose the FAGhead, a novel method based on 3DMMs representation for high-fidelity avatars construction and animation. In spirited by previous works [23, 34], which fully explicit the FLAME [25] model via linear blend skinning (LBS) in multi-view camera setting, we expand it in monocular setting. Regarding decoupling, we separate identity and expression information during preprocessing through a modified face tracker [59].

Regarding Gaussian initialization, we propose a novel Point-based Learnable Representation Field (PLRF) approach that positions Gaussian points along the midline of a single triangle face, thereby increasing the density of Gaussian points and enhancing facial expression details. Specifically, instead of initializing the 3D Gaussian points at the center of each triangle face of the FLAME mesh, we sample Gaussian points with learnable positions along the line segments connecting the centroid to each vertex of the triangle face within each avatar mesh.

Moreover, a transform network is built to match the dynamic facial movements from canonical point-based field to the transfrom point-bese field. In practice, it takes pre-retrieved FLAME parameters as conditions to produce the facial movement deformation. Besides, to enhance the rendering edge performance as hair and shoulder, we introduce the alpha loss between alpha map and the rendered. With the assistance of these enhancements, FAGhead aachieves higher-fidelity rendering and provides fully controllable avatars over facial expressions and head poses. In summary, our contributions are as follows:

  • \bullet

    We propose Fully Animate Gaussian Head, a novel head avatar synthesis approach with effective representation field, which could provide fully controllable head avatars and achieve high-fidelity face reenactment.

  • \bullet

    We propose a Point-based Learnable Representation Field(PLRF), which significantly enhance the generated quality and geometry detail structure.

  • \bullet

    We introduce the transform network to fit the deformation defined by 3DMM model.

  • \bullet

    We redesign the model structure and propose the additional loss item as alpha loss item and other regularization loss to enhance the performance of geometry reconstruction.

2 Related Work

2.1 Scene Reconstruction and Novel View Synthesis

Early novel view synthesis mainly focuses on image interpolation and image-based rendering(IBR) [39, 55], which achieves novel view synthesis by image interpolation between nearest-neighbour images. In the recent works, Significant advancements have been made in the fields of Structure from Motion (SfM) [38] and Multi-view Stereo (MVS) [44], both of which utilize explicit 3D scene representations through RGB images [3, 40, 15]. Besides, COLMAP [38] plays a significant role, which provides the reliable 3D space feature points matched across multi-view images. With development of neural rendering [42], more effective and realistic approaches are proposed, especially Neural radiance fields (NeRF) [29]. NeRF achieves high realistic reconstruction with manipulating an MLP network as 3D space expression and volume rendering.

To enhance the efficiency of NeRF, Point-NeRF [50] combines a neural point cloud initialized via a deep network, accelerating training time via a novel point cloud pruning and growing mechanism. NSVF [26] uses a sparse voxel octree to represent the scene which avoid the computational waste. Meanwhile, a various of advanced NeRFs employing another explicit representations [11, 53, 41, 6, 17] are proposed to overcome slow speed issue. However, there is still a host of room for improvement, the render speed cannot meet the real-time requirement. 3D Gaussian Splatting [20] provides a faster and more efficient method for reconstructing the 3D scene, exhibiting a easy and effective performance with high quality.

2.2 3D Parameter Head Model

Building on the premise that the human head shape space can be effectively separated into identity, expression, and appearance components. The 3D Morphable Model (3DMM) [2] is proposed, which is used to embed 3D head shapes into multiple low-dimensional principal component analysis (PCA) spaces. Subsequent research [4, 46, 37, 16] has expanded this mesh-based parametric head model, enhancing its representational capacity through the development of multi-linear [4] and non-linear models [45], as well as articulated models equipped with corrective blendshapes.

Recent cutting-edge techniques have advanced in accurately capturing the intricate deformations associated with facial expressions by incorporating additional displacement maps that respond to the input images [9, 10]. In addition, generative models powered by machine learning, such as GANs [18] and StyleGAN [19], have been integrated into current frameworks [13, 28] to refine the precision in modeling facial textures and geometries. However, despite these advances, the parametric models are still mostly limited to capturing the facial region’s geometry and appearance at a rather basic level through explicit mesh models. This limitation detracts from the photorealistic quality of reconstructions and animations based on these models [14].

2.3 3D Head Portrait Synthesis

3D head portrait synthesis could be divide into explicit and implicit representations. The explicit representations, primarily based on mesh models [4], which have been evolving for many years. In recent efforts, some researchers [21, 43] have utilized 2D neural rendering techniques for creating photo-realistic portraits, although these method often overlook non-facial areas or face challenges with temporal and spatial inconsistencies due to their weak integration with 3D geometry. Other approaches [10, 14] have focused on learning vertex offsets to capture the detailed head geometry more accurately, but they can still encounter issues with geometry and texture artifacts in complex areas like hair, eyes, and mouth, limited by the mesh model’s representational capacity and the challenges of differentiable rendering. PointAvatar [58] introduced a novel, deformable point-based approach, overcoming some mesh model limitations, albeit at the cost of requiring an excessive number of points and extensive training periods. Implicit models, on the other hand, utilize neural functions to create digital head avatars, with significant research [57] dedicated to achieving high fidelity, though often at the expense of training and inference efficiency. Innovations such as volumetric primitives [27] and local feature grids [12, 52, 60] have been proposed to enhance efficiency and reduce the computational load. Moreover, FlashAvatar [49] proposed the UV sample strategy to enhance rendering efficiency.

3 Preliminary

There is a enormous difference compare 3D Gaussian Splatting [20] with the widely adopted Neural Radiance Field. 3DGS utilizes explicit 3D Gaussian points which initializes with the feature points generated by COLMAP [38] as the fundamental entities for rendering. Consider the rendering entities set {Gt}t=0Nsubscriptsuperscriptsubscript𝐺𝑡𝑁𝑡0\{G_{t}\}^{N}_{t=0}{ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT, t𝑡titalic_t is index of 3D Gaussian point, N𝑁Nitalic_N is the number of 3D Gaussian point. Each point has following proprieties ζt={μt,ot,rt,st,ct}subscript𝜁𝑡subscript𝜇𝑡subscript𝑜𝑡subscript𝑟𝑡subscript𝑠𝑡subscript𝑐𝑡\zeta_{t}=\{\mu_{t},o_{t},r_{t},s_{t},c_{t}\}italic_ζ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. μt3subscript𝜇𝑡superscript3\mu_{t}\in\mathbb{R}^{3}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is the position of Gaussian point, otsubscript𝑜𝑡o_{t}\in\mathbb{R}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R describes the opacity, rt4subscript𝑟𝑡superscript4r_{t}\in\mathbb{R}^{4}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT indicates the rotation, st3subscript𝑠𝑡superscript3s_{t}\in\mathbb{R}^{3}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT reflects the scale of Gaussian point, ct3subscript𝑐𝑡superscript3c_{t}\in\mathbb{R}^{3}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT denotes the view-dependent color, calculated via spherical harmonic coefficients, which k𝑘kitalic_k is related with the degree of spherical harmonic. Each 3D Gaussian point is mathematically defined with a spatial mean μ𝜇\muitalic_μ and a covariance matrix ΣΣ\Sigmaroman_Σ as following:

Gt(x)=exp(12(xμ)TΣ1(xμ)),subscript𝐺𝑡𝑥12superscript𝑥𝜇𝑇superscriptΣ1𝑥𝜇G_{t}(x)=\exp({-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}),italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) ) , (1)

During rendering, all 3D Gaussian points are projected to the specified 2D camera plane at first. According to the prior work [61], given a viewing transformation W𝑊Witalic_W, the 3D covariance matrix ΣsuperscriptΣ\Sigma^{\prime}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be reasoned as follow:

Σ=JWΣWTJT,superscriptΣ𝐽𝑊Σsuperscript𝑊𝑇superscript𝐽𝑇\Sigma^{\prime}=JW\Sigma W^{T}J^{T},roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (2)

where J𝐽Jitalic_J denotes the Jacobian of the affine approximation of perspective projection transformation.

Thus, the color C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG of specified pixel can be synthesized as:

C^(x)=tMctαt(x)j=1t1(1αj(x))^𝐶𝑥subscript𝑡𝑀subscript𝑐𝑡subscript𝛼𝑡𝑥superscriptsubscriptproduct𝑗1𝑡11subscript𝛼𝑗𝑥\hat{C}(x)=\sum_{t\in M}c_{t}\alpha_{t}(x)\prod_{j=1}^{t-1}(1-\alpha_{j}(x))over^ start_ARG italic_C end_ARG ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_t ∈ italic_M end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ) (3)

with

αt(x)=otexp(12(xμ)TΣ1(xμ))subscript𝛼𝑡𝑥subscript𝑜𝑡12superscript𝑥𝜇𝑇superscriptΣ1𝑥𝜇\alpha_{t}(x)=o_{t}\exp({-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)})italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) ) (4)

To optimize shared memory usage, 3DGS has developed a GPU-optimized rasterization process that assigns each thread block to an image tile. This innovative approach not only allows for realistic scene reconstruction but also delivers considerably faster rendering speeds and decreases memory consumption during training, outperforming NeRF approach.

Refer to caption
Figure 2: FAGhead overview. before training, a Point-based Learnable Representation Field is established on the timestep 0 as canonical point-based field. Via the transform network that input the canonical global position and current FLAME parameters and output the deformation between canonical and transform space. Besides, we introduce the alpha rendering in order to eliminate the geometry mistake.
Refer to caption
Figure 3: The pipeline of Point-based Learnable Representation Field initialization and growth. We allocate four Gaussian points of each triangle as initialization. During training, the positions of Gaussian points will be dynamically adjusted. Meanwhile, we adopt the adaptive density control and growth strategy, which adds and removes splats based on the viewspace positional gradient and the opacity of each Gaussian point.

4 Method

The proposed model is outlined in Fig. 2. Firstly, we preprocess the given monocular video into FLAME parameters as detailed in Sec. 4.1. During the initialization stage (Sec. 4.2), we employ the Point-based Learnable Representation Field aligned with the canonical mesh, where the position dynamically adjusts throughout training. Subsequently, we fit the deformation between the canonical space and the current frame utilizing the transform network (Sec. 4.3). To further enhance the geometry reconstruction performance, we introduce alpha rendering (Sec. 4.4).

4.1 Data Preprocessing

Given a monocular video V𝑉Vitalic_V consisting of images I={Ii}𝐼subscript𝐼𝑖I=\{I_{i}\}italic_I = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, our objective is to extract the camera parameters including the intrinsic parameters Kisubscript𝐾𝑖K_{i}italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the camera poses Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the FLAME meshes Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and corresponding properties set ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT like shape, expression ,and the pose like jaw, neck, and eyes. Given that a monocular video provides only a single camera view, it inherently contains less information than the datasets used in Nersemble [23]. This limitation necessitates more rigorous data preprocessing. In our experiments, the failure to align the initialized parametric mesh with the ground truth image results in disorganized results.

To adapt effectively to the monocular video setting, we fix the neck pose during processing and solely optimize the camera pose relative to the head, diverging from the approach in GaussianAvatars [34]. Moreover, we replace the screen coordinates to Normalized Device Coordinates (NDC) using pytorch3D [35], ensuring greater compatibility with 3DGS. Our optimization focuses on the field of view (FOV) while kee** the properties Znearsubscript𝑍𝑛𝑒𝑎𝑟Z_{near}italic_Z start_POSTSUBSCRIPT italic_n italic_e italic_a italic_r end_POSTSUBSCRIPT and Zfarsubscript𝑍𝑓𝑎𝑟Z_{far}italic_Z start_POSTSUBSCRIPT italic_f italic_a italic_r end_POSTSUBSCRIPT. During preprocessing, we initially optimize the identity information within the FLAME2020 framework and subsequently optimize expressions and other properties while kee** the identity information fixed, effectively decoupling identity from expression information. Further details are available in the supplementary materials.

4.2 Point-based Learnable Representation Field

Given the shape, pose and expression components, a morphologically realistic mesh can be produced via FLAME framework. The key is how to establish connection between the FLAME mesh with 3D Gaussian Splatting effectively. Motivated by GaussianAvatars [34], we build a Point-based Learnable Representation Field(PLRF) on the original meshes in order to augment the original initialized Gaussian point field. As shown in Fig. 3, instead of initializing one Gaussian point at the center of each triangle face, we adopt the learnable strategy. Given the three vertices of a triangle set {x1,x2,x3}subscript𝑥1subscript𝑥2subscript𝑥3\{x_{1},x_{2},x_{3}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, we first take their mean position x¯¯𝑥\bar{x}over¯ start_ARG italic_x end_ARG by:

x¯=mean(x1,x2,x3)¯𝑥𝑚𝑒𝑎𝑛subscript𝑥1subscript𝑥2subscript𝑥3\bar{x}=mean(x_{1},x_{2},x_{3})over¯ start_ARG italic_x end_ARG = italic_m italic_e italic_a italic_n ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) (5)

Via the barycentric coordinate x¯¯𝑥\bar{x}over¯ start_ARG italic_x end_ARG, we get the lines with the three vertices of a triangle set {x¯x1,x¯x2,x¯x3}¯𝑥subscript𝑥1¯𝑥subscript𝑥2¯𝑥subscript𝑥3\{\bar{x}x_{1},\bar{x}x_{2},\bar{x}x_{3}\}{ over¯ start_ARG italic_x end_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }. Besides, we allocate the Gaussian points on the lines by:

x1subscriptsuperscript𝑥1\displaystyle x^{\prime}_{1}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =(1n)x¯+nx1absent1𝑛¯𝑥𝑛subscript𝑥1\displaystyle=(1-n)*\bar{x}+n*x_{1}= ( 1 - italic_n ) ∗ over¯ start_ARG italic_x end_ARG + italic_n ∗ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (6)
x2subscriptsuperscript𝑥2\displaystyle x^{\prime}_{2}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(1n)x¯+nx2absent1𝑛¯𝑥𝑛subscript𝑥2\displaystyle=(1-n)*\bar{x}+n*x_{2}= ( 1 - italic_n ) ∗ over¯ start_ARG italic_x end_ARG + italic_n ∗ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
x3subscriptsuperscript𝑥3\displaystyle x^{\prime}_{3}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =(1n)x¯+nx3absent1𝑛¯𝑥𝑛subscript𝑥3\displaystyle=(1-n)*\bar{x}+n*x_{3}= ( 1 - italic_n ) ∗ over¯ start_ARG italic_x end_ARG + italic_n ∗ italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT

where n𝑛nitalic_n is a learned parameter ranging [0,1]01[0,1][ 0 , 1 ] with is initialized at 0.5. Thus, the new positions of the Gaussian point is represented as {x1,x2,x3,x¯}subscriptsuperscript𝑥1subscriptsuperscript𝑥2subscriptsuperscript𝑥3¯𝑥\{x^{\prime}_{1},x^{\prime}_{2},x^{\prime}_{3},\bar{x}\}{ italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG }.

We attribute four Gaussian point at each triangle face when initialization, which is equivalent to splitting a complete triangle face into four sub-face. Additionally, we count the Gaussian points separately which are cloned during training. It is this approach named as PLRF that enhancing the density of the Gaussian points and avoiding the hole on the rendered avatars.

After reconstructing the point-based field, we adopt the global transform which transform the local Gaussian points from triangle face to the global space represented as:

rgsuperscript𝑟𝑔\displaystyle r^{g}italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT =𝐑rabsent𝐑𝑟\displaystyle=\mathbf{R}r= bold_R italic_r (7)
μgsuperscript𝜇𝑔\displaystyle\mu^{g}italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT =12𝐤𝐑μ+{x1,x2,x3,x¯}absent12𝐤𝐑𝜇subscriptsuperscript𝑥1subscriptsuperscript𝑥2subscriptsuperscript𝑥3¯𝑥\displaystyle=\frac{1}{2}\mathbf{k}\mathbf{R}\mu+\{x^{\prime}_{1},x^{\prime}_{% 2},x^{\prime}_{3},\bar{x}\}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_kR italic_μ + { italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , over¯ start_ARG italic_x end_ARG } (8)
sgsuperscript𝑠𝑔\displaystyle s^{g}italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT =12𝐤nsabsent12𝐤𝑛𝑠\displaystyle=\frac{1}{2}\mathbf{k}ns= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_k italic_n italic_s (9)

where {r,μ,s}𝑟𝜇𝑠\{r,\mu,s\}{ italic_r , italic_μ , italic_s } are the Gaussian raw attributes in local space, {rg,μg,sg}superscript𝑟𝑔superscript𝜇𝑔superscript𝑠𝑔\{r^{g},\mu^{g},s^{g}\}{ italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT } are the Gaussian attributes in global space, 𝐑={R,R,R,R}𝐑𝑅𝑅𝑅𝑅\mathbf{R}=\{R,R,R,R\}bold_R = { italic_R , italic_R , italic_R , italic_R }, 𝐤={k,k,k,k}𝐤𝑘𝑘𝑘𝑘\mathbf{k}=\{k,k,k,k\}bold_k = { italic_k , italic_k , italic_k , italic_k }. We define R𝑅Ritalic_R as global rotation represent as the orientation of the triangle in the global space, k𝑘kitalic_k by the mean length of one of the edges and its perpendicular as the triangle scaling. Here, we repeat the original scale and rotation four times to fit the alteration on positions. Each triangle owns its unique triangle scaling factor k𝑘kitalic_k and global rotation R𝑅Ritalic_R, which are fixed during training.

Besides, we adopt the adaptive density control and growth. For each 3D Gaussian that exhibits a large view-space positional gradient, we either split it into two smaller Gaussian points if it is large, or clone it if it is small.

Because of the learnable sampling strategy, to prevent the scale of a Gaussian point within a triangle face from being too large, we introduce the scale regularization on the scale prosperity, which we will describe in Sec. 4.5.

4.3 Transform Network

Here, we get the global Gaussian attribute in the global space rgsuperscript𝑟𝑔r^{g}italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, μgsuperscript𝜇𝑔\mu^{g}italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, sgsuperscript𝑠𝑔s^{g}italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, which attach on mesh. However, there are still non-surface regions and subtle facial details can not be modeled by FLAME framework. To fit the Gaussian deformation due to the expression motions, we employ a standard MLP network Fθsubscript𝐹𝜃F_{\theta}italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT for Gaussian deformation transform. Here, we define the Point-based Learnable Representation Field at timestep 0 as initial canonical space. The MLP network take the position of initial canonical space μ0gsubscriptsuperscript𝜇𝑔0\mu^{g}_{0}italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and current FLAME properties ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as expression and the pose except the shape property as input, and outputs spatial residuals of Gaussian attribute:

δμi,δsi,δri=Fθ(γ(μ0g),ρi)𝛿subscript𝜇𝑖𝛿subscript𝑠𝑖𝛿subscript𝑟𝑖subscript𝐹𝜃𝛾subscriptsuperscript𝜇𝑔0subscript𝜌𝑖\displaystyle\delta\mu_{i},\delta s_{i},\delta r_{i}=F_{\theta}(\gamma(\mu^{g}% _{0}),\rho_{i})italic_δ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_γ ( italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (10)

where θ𝜃\thetaitalic_θ represents the optimized parameters of the standard Multi-Layer Perceptron (MLP), γ𝛾\gammaitalic_γ represents the positional encoding of the spatial coordinates of the 3D Gaussian into a high-dimensional sequence as described in NeRF[29]. Here, we choose 10 as the frequency of the positional encoding. Final, the current spatial parameters which are input into the final render processing can be represented as:

μig,sig,rig,=μ0gδμi,s0gδsi,r0gδri\displaystyle\mu^{g}_{i},s^{g}_{i},r^{g}_{i},=\mu^{g}_{0}\oplus\delta\mu_{i},s% ^{g}_{0}\oplus\delta s_{i},r^{g}_{0}\oplus\delta r_{i}italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , = italic_μ start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊕ italic_δ italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊕ italic_δ italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⊕ italic_δ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (11)

4.4 Geometry Enhancement

The hair strands or other model non-facial structures such as eyeglasses and hairstyles make a profound impact on the photo-realistic rendering. In this work, we incorporate a alpha map A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG as each ground true view. We introduce the alpha rendering as follow:

A^(x)=iMαi(x)j=1i1(1αj(x))^𝐴𝑥subscript𝑖𝑀subscript𝛼𝑖𝑥superscriptsubscriptproduct𝑗1𝑖11subscript𝛼𝑗𝑥\hat{A}(x)=\sum_{i\in M}\alpha_{i}(x)\prod_{j=1}^{i-1}(1-\alpha_{j}(x))over^ start_ARG italic_A end_ARG ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_M end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) ) (12)

We then encourage the consistency between the rendered alpha map A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG and the pseudo alpha map A𝐴Aitalic_A, which is quantified as follows:

α=λαA^A2subscript𝛼subscript𝜆𝛼superscriptnorm^𝐴𝐴2\mathcal{L}_{\alpha}=\lambda_{\alpha}\|\hat{A}-A\|^{2}caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ∥ over^ start_ARG italic_A end_ARG - italic_A ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (13)

where λαsubscript𝜆𝛼\lambda_{\alpha}italic_λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is a hyperparameter equal to 0.5 and the pseudo A𝐴Aitalic_A is obtained from SGHM [7].

4.5 Optimization Scheme

Given the rendered image C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG via Eq. 3 and the ground truth image C𝐶Citalic_C, we utilize the L1 loss l1subscript𝑙1\mathcal{L}_{l1}caligraphic_L start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT and a D-SSIM term [56] to supervise the pixel-wise difference by:

color=(1λssim)subscript𝑐𝑜𝑙𝑜𝑟1subscript𝜆𝑠𝑠𝑖𝑚\displaystyle\mathcal{L}_{color}=(1-\lambda_{ssim})caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT = ( 1 - italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT ) l1+λssimDSSIMsubscript𝑙1subscript𝜆𝑠𝑠𝑖𝑚subscript𝐷𝑆𝑆𝐼𝑀\displaystyle\mathcal{L}_{l1}+\lambda_{ssim}\mathcal{L}_{D-SSIM}caligraphic_L start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D - italic_S italic_S italic_I italic_M end_POSTSUBSCRIPT (14)
l1subscript𝑙1\displaystyle\mathcal{L}_{l1}caligraphic_L start_POSTSUBSCRIPT italic_l 1 end_POSTSUBSCRIPT =C^C2absentsuperscriptnorm^𝐶𝐶2\displaystyle=\|\hat{C}-C\|^{2}= ∥ over^ start_ARG italic_C end_ARG - italic_C ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where λssimsubscript𝜆𝑠𝑠𝑖𝑚\lambda_{ssim}italic_λ start_POSTSUBSCRIPT italic_s italic_s italic_i italic_m end_POSTSUBSCRIPT is a hyperparameter equal to 0.4.

Besides, to maintain the structure information of the rendered image C^^𝐶\hat{C}over^ start_ARG italic_C end_ARG with ground truth image C𝐶Citalic_C, we additionally add a structure loss stsubscript𝑠𝑡\mathcal{L}_{st}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT between them, to maintain the detail information and increase the contrast.

st=λst((lr(C^)lr(C))2+(rl(C^)lr(C))2),subscript𝑠𝑡subscript𝜆𝑠𝑡superscriptsubscript𝑙𝑟^𝐶subscript𝑙𝑟𝐶2superscriptsubscript𝑟𝑙^𝐶subscript𝑙𝑟𝐶2\mathcal{L}_{st}=\lambda_{st}((\nabla_{lr}(\hat{C})-\nabla_{lr}(C))^{2}+(% \nabla_{rl}(\hat{C})-\nabla_{lr}(C))^{2}),caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT ( ( ∇ start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) - ∇ start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT ( italic_C ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( ∇ start_POSTSUBSCRIPT italic_r italic_l end_POSTSUBSCRIPT ( over^ start_ARG italic_C end_ARG ) - ∇ start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT ( italic_C ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (15)

where lrsubscript𝑙𝑟\nabla_{lr}∇ start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT denotes the gradients of image calculating from left to right, rlsubscript𝑟𝑙\nabla_{rl}∇ start_POSTSUBSCRIPT italic_r italic_l end_POSTSUBSCRIPT denotes the gradients of image calculating from right to left and λstsubscript𝜆𝑠𝑡\lambda_{st}italic_λ start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT is a hyperparameter equal to 0.3. Here we manipulate a 2D convolution operation to calculate the gradients. Due to we expand the multi-view setting to single-view, there are some regularization items we introduce to make sure the consistency between the monocular video and the FLAME mesh.

Scale regularization on invisible point. Under the multi-view setting, most Gaussian points around the original FLAME mesh are visible. However, since the constraints of monocular view, which captures only a limited number of points compared to multi-view perspectives, Gaussian points located in positions not visible from the monocular camera view may negatively impact rendering images under novel viewpoints.

Thus, we introduce the invisible point scale regularization by:

invis=ssubscript𝑖𝑛𝑣𝑖𝑠norm𝑠\mathcal{L}_{invis}=\|\mathcal{M}*s\|caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v italic_i italic_s end_POSTSUBSCRIPT = ∥ caligraphic_M ∗ italic_s ∥ (16)
={1,if radii<00,otherwisecases1if radii0otherwise0otherwiseotherwise\mathcal{M}=\begin{cases}1,\mbox{if radii}<0\\ 0,\mbox{otherwise}\end{cases}caligraphic_M = { start_ROW start_CELL 1 , if radii < 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , otherwise end_CELL start_CELL end_CELL end_ROW (17)

where \mathcal{M}caligraphic_M represent the invisibility mask.

Scale threshold regularization. As for the visible Gaussian points, if the scale of a Gaussian point within a triangle face is too large, it will result in unreasonable jittering artifacts. In order to mitigate this, we add the threshold regularization to the visible Gaussian points:

scale=max((1)s,ξscaling)subscript𝑠𝑐𝑎𝑙𝑒norm1𝑠subscript𝜉𝑠𝑐𝑎𝑙𝑖𝑛𝑔\mathcal{L}_{scale}=\left\|\max((1-\mathcal{M})s,\xi_{scaling})\right\|caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT = ∥ roman_max ( ( 1 - caligraphic_M ) italic_s , italic_ξ start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT ) ∥ (18)

with ξscaling=0.15subscript𝜉𝑠𝑐𝑎𝑙𝑖𝑛𝑔0.15\xi_{scaling}=0.15italic_ξ start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT = 0.15.

Our final loss function is thus:

=color+α+st+λinvisinvis+λscalescalesubscript𝑐𝑜𝑙𝑜𝑟subscript𝛼subscript𝑠𝑡subscript𝜆𝑖𝑛𝑣𝑖𝑠subscript𝑖𝑛𝑣𝑖𝑠subscript𝜆𝑠𝑐𝑎𝑙𝑒subscript𝑠𝑐𝑎𝑙𝑒\mathcal{L}=\mathcal{L}_{color}+\mathcal{L}_{\alpha}+\mathcal{L}_{st}+\lambda_% {invis}\mathcal{L}_{invis}+\lambda_{scale}\mathcal{L}_{scale}caligraphic_L = caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_l italic_o italic_r end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_v italic_i italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v italic_i italic_s end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT (19)

with λ𝜆\lambdaitalic_λ denoting the weights of each regularization term, which are set as follows: λinvis=0.3,λscale=0.15formulae-sequencesubscript𝜆𝑖𝑛𝑣𝑖𝑠0.3subscript𝜆𝑠𝑐𝑎𝑙𝑒0.15\lambda_{invis}=0.3,\lambda_{scale}=0.15italic_λ start_POSTSUBSCRIPT italic_i italic_n italic_v italic_i italic_s end_POSTSUBSCRIPT = 0.3 , italic_λ start_POSTSUBSCRIPT italic_s italic_c italic_a italic_l italic_e end_POSTSUBSCRIPT = 0.15.

5 Experiments

5.1 Experiment Setup

Refer to caption
Figure 4: Capturing details of our dataset.

Dataset. In the experiment, we use 6 sets of data which mainly released by previous works [49, 12]. Each dataset mainly contains a complete video with 25 FPS and 512*512 resolution. The length of the processed video is between 2 and 3 minutes, roughly 3000-4000 frames. We utilize RemBg for foreground segmentation and SGHM [7] for alpha map. Here, all subjects are open-source.

Besides, we capture our own dataset containing 3 subjects to verify the robustness of our approach using iPhone 15 Pro Max. We use a capture tripod to prevent the phone from shaking. The person sitting in front of the iPhone camera are prompted to rotate their heads and enact various face expressions, the capturing details are shown in Fig. 4. All the captured videos will be cropped and down-sample to 600*600 resolution, each video contain approximately 1500-2000 frames. We use the preprocessing code to extract the Flame parameters. Further details of the data preprocessing pipeline are available in the supplementary material.

Implementation Details. We implement our network with Pytorch [32] and use Adam [22] for parameter optimization. The learning rate of the Gaussian’s parameters is the same of the original implementation, while the learning rate of the transform network is 1e41𝑒41e-41 italic_e - 4. We train our model for 120000 iterations. Every 3000 iterations, we reset the opacity prosperity of Gaussian points. Then, we will perform a dynamic adaptation control strategy every 400 rounds until the 60000 iterations. We use a single Nvidia GeForce RTX4090 GPU for all of our experiments.

Baseline. We compare our method with three representative works, as INSTA [60], representing an efficient implicit head representation that creates a surface-embedded dynamic neural radiance field, based on neural graphics primitives; GaussianAvatars [34], a novel approach with Gaussian points bind to drive the Gaussian rendering; Flashavatar [49], utilizing the UV sample as Gaussian initialization and offset network to eliminate the FLAME offset. For a fair comparison, both GaussianAvatars and FlashAvatar employ the FLAME2020 model, and their training settings are similar to ours. All experiments are conducted using a single Nvidia GeForce RTX 4090 GPU.

Refer to caption
Figure 5: Alpha rendering results of FAGHead.
Refer to caption
Figure 6: Qualitative comparison of ID 1-6 (from top to bottom) on the open-source datasets. FAGHead shows improved performances over strong baselines in capturing fine details such as shoulder strands, teeth, necklaces, etc..
Refer to caption
Figure 7: Qualitative comparison on subject 1-3 (from top to bottom) on our capturing datasets.
Refer to caption
Figure 8: Novel view synthesis results of FAGHead. We demonstrate that it can produce high-fidelity geometry and appearance from perspectives even not encountered during training.
Table 1: Quantitative comparisons with state-of-the-art head avatar reconstruction method on the open-source datasets.
Method "ID1" "ID2" "ID3" "ID4" "ID5" "ID6"
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
INSTA [60] 17.28 0.852 0.207 15.95 0.831 0.221 14.63 0.846 0.177 19.84 0.921 0.112 13.83 0.806 0.251 12.56 0.796 0.272
GaussianAvatars [34] 25.12 0.906 0.156 19.75 0.903 0.150 17.77 0.907 0.138 23.56 0.952 0.050 19.02 0.857 0.222 17.63 0.782 0.272
FlashAvatar [49] 28.69 0.911 0.116 25.46 0.902 0.105 23.70 0.938 0.096 25.05 0.951 0.075 26.52 0.916 0.097 24.25 0.821 0.158
Our 28.76 0.934 0.054 25.81 0.935 0.062 22.11 0.939 0.075 26.92 0.963 0.029 25.14 0.918 0.074 23.51 0.866 0.128
Table 2: Quantitative comparisons with state-of-the-art head avatar reconstruction method on our capturing datasets.
Method "subject1" "subject2" "subject3"
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
INSTA [60] 17.38 0.812 0.242 18.07 0.835 0.236 15.31 0.771 0.264
GaussianAvatars [34] 23.02 0.859 0.144 21.63 0.893 0.154 26.49 0.904 0.088
FlashAvatar [49] 26.68 0.860 0.079 27.23 0.907 0.076 28.48 0.900 0.074
Our 27.83 0.900 0.066 28.31 0.929 0.064 30.61 0.941 0.041

5.2 Qualitative and Quantitative Comparison in Reconstruction

Fig. 5 visualizes the alpha rendering result of FAGhead. Fig. 6 and Fig. 7 depict the qualitative comparison between our model and the above method.

INSTA [60] is able to generate renders that are reasonably consistent with articulated facial expression and head-pose. However, INSTA utilizes neural graphics primitives integrated within the FLAME surface, which limits its ability to accurately model accessories such as ties and necklaces(in the 3rd and 5th rows). Besides, it tends to generate artifacts around the mouth(the 2nd and 5th row in Fig. 6 and the 1st row in Fig. 7) and smooth results with ignoring thin structures, especially in the shoulder region.

GaussianAvatars [34] primarily relies on a parametric morphable face model to rig 3D Gaussian splats. It achieves a more precise geometric representation by attaching the 3D Gaussian splats from local to global coordinates within a FLAME mesh, indicating that the rendering performance hinges on the quality of the parametric morphable face model. Thus, it tend to generate artifacts around eyes because of the geometry errors(the 1st row in Fig. 6). Besides, due to exploitation of the parametric morphable face model, it usually leads to a over-smooth results of the accessories(the 3rd and 5th rows in Fig. 6 and the 1st row in Fig. 7).

FlashAvatar [49] employs a uniform 3D Gaussian field embedded in the surface to initialize. However, becasue of the fix number of 3D Gaussian splats without the adaptive density control strategy, artifacts will show up as spikes at the edge of the avatars like shoulder or hair(in the 1st rows). Moreover, blurs will appear around the eye region and mouth region(the 2nd and 3rd row in Fig. 6 and 3rd row in Fig. 7). In contrast, our method produces photorealistic images that closely align with the ground truth, capturing nearly all fine facial details, thin structures, and accessories.

Please see the Tab. 1 and Tab. 2 for the quantitative comparison between our model and other method. The metrics include PSNR, SSIM, and LPIPS [56].

Refer to caption
Figure 9: Qualitative results of ours and three other method on facial reenactment task. Our method preserves personalized facial details in gaze direction, and interior mouth regions and synthesizes more natural results.
Refer to caption
Figure 10: The effect of Point-based Learnable Representation.
Refer to caption
Figure 11: The effect of Transform network.
Refer to caption
Figure 12: The effect of Alpha rendering.

5.3 Novel View Synthesis

To demonstrate the novel view rendering ability of our FAGhead, we freely rotate the camera pose to generate new results from novel rendering views as illustrated in Fig. 8. We first utilize a specified sample from the test dataset with FLAME parameter and camera pose. The multiple viewpoints can be produced via multiplying the camera pose with a novel rotation angle. Then the rendered images with different rendering views are generated by the Gaussian rendering. The novel view results show no artifacts or unrealistic facial expressions, even at intermediate rotation angles.

5.4 Cross-identity Reenactment

The cross-identity reenactment results are presented in Fig. 9. Via replacing the input FLAME parameters of transform network, our approach can conduct the expression reenactment with finer-grained teeth and gaze direction.

Compared with the previous works, GaussianAvatars [34] shows the artifacts in shoulder edge and the unnatural gaze direction. With depending on the offset network, the results generated by FlashAvatar [49] seems a bit over-smooth. As for INSTA [60], despite its deformable neural radiance field, tends to lose details in geometry. More reenactment results are available in the supplementary materials.

Table 3: We systematically ablated several key components and assessed quantitative performance to demonstrate their effectiveness.
Metrics PSNR↑ SSIM↑ LPIPS↓
w/o PLRF 25.61 0.884 0.239
w/o Transform Network 25.21 0.902 0.104
w/o Alpha rendering 28.11 0.927 0.065
w/o Construct loss stsubscript𝑠𝑡\mathcal{L}_{st}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT 28.47 0.931 0.058
Our 28.75 0.934 0.054

5.5 Ablation Study

To validate the effectiveness of our method components, we deactivate each of them and report results in Tab. 3.

Point-based Learnable Representation Field. Without Point-based Learnable Representation Field, we randomly initialize 100000 Gaussian points as for creating Gaussian splatting as canonical space. With directly optimizing the Gaussian point attributes, we obtain the final results with spatial residuals output from the transform network. As shown in the 1st row of Tab. 3 and Fig. 10, the expression details are not well captured, demonstrating the influence of Point-based Learnable Representation Field in preserving fine details.

Transform Network. Without transform network, we just utilize the raw FLAME parameters to fit the transform mesh rely on the linear transformations defined by the LBS formula from the original FLAME model[25]. With the limited geometry information, it causes blurry renderings, leading to huge fidelity loss, as shown in the 2nd row of Tab. 3 and Fig. 11.

Alpha rendering. Without alpha rendering, artifacts will show up as spikes at the edge of the avatars like shoulder or hair. Besides, the hole will appear between the neck and the collar as shown in the 3rd row of Tab. 3 and Fig. 12, which is absolutely unreasonable. Because the alpha rendering make sure the geometry consistency, without it, the Gaussian rendering process relies solely on color information and disregards geometric attributes.

Construct loss stsubscript𝑠𝑡\mathcal{L}_{st}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT. Construct loss stsubscript𝑠𝑡\mathcal{L}_{st}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT maintain the detail information and increase the contrast. We shown the results without the construct loss stsubscript𝑠𝑡\mathcal{L}_{st}caligraphic_L start_POSTSUBSCRIPT italic_s italic_t end_POSTSUBSCRIPT in the 4th row of Tab. 3.

6 Conclusion and Discussion

FAGhead is a novel approach which achieve high-fidelity reconstruction and full animation of 3D human avatars. We propose the Point-based Learnable Representation Field as the prior to reconstruct portraits and exploit the transform network to fit the deformation, outperforming state-of-the-art method. However, our method still has room for further improvement. One limitation is that it does not model the oral cavity effectively. Moreover, our rendering performance is heavily dependent on the quality of data preprocessing, indicating that we struggle to handle significant errors in this stage effectively. Addressing these issues will be a key focus of our future research efforts.

References

  • [1] Athar, S., Xu, Z., Sunkavalli, K., Shechtman, E., Shu, Z.: Rignerf: Fully controllable neural 3d portraits. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. pp. 20364–20373 (2022)
  • [2] Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 157–164 (2023)
  • [3] Buehler, C., Bosse, M., McMillan, L., Gortler, S., Cohen, M.: Unstructured lumigraph rendering. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 497–504 (2023)
  • [4] Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20(3), 413–425 (2013)
  • [5] Chaudhuri, S., Kalogerakis, E., Giguere, S., Funkhouser, T.: Attribit: content creation with semantic attributes. In: Proceedings of the 26th annual ACM symposium on User interface software and technology. pp. 193–202 (2013)
  • [6] Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: Tensorf: Tensorial radiance fields. In: European Conference on Computer Vision. pp. 333–350. Springer (2022)
  • [7] Chen, X., Zhu, Y., Li, Y., Fu, B., Sun, L., Shan, Y., Liu, S.: Robust human matting via semantic guidance. In: Proceedings of the Asian Conference on Computer Vision (ACCV) (2022)
  • [8] Chen, Y., Wang, L., Li, Q., Xiao, H., Zhang, S., Yao, H., Liu, Y.: Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558 (2023)
  • [9] Daněček, R., Black, M.J., Bolkart, T.: Emoca: Emotion driven monocular face capture and animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20311–20322 (2022)
  • [10] Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG) 40(4), 1–13 (2021)
  • [11] Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5501–5510 (2022)
  • [12] Gao, X., Zhong, C., Xiang, J., Hong, Y., Guo, Y., Zhang, J.: Reconstructing personalized semantic facial nerf models from monocular video. ACM Transactions on Graphics (TOG) 41(6), 1–12 (2022)
  • [13] Gecer, B., Ploumpis, S., Kotsia, I., Zafeiriou, S.: Fast-ganfit: Generative adversarial network for high fidelity 3d face reconstruction. arXiv preprint arXiv:2105.07474 (2021)
  • [14] Grassal, P.W., Prinzler, M., Leistner, T., Rother, C., Nießner, M., Thies, J.: Neural head avatars from monocular rgb videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 18653–18664 (2022)
  • [15] Han, X.F., Laga, H., Bennamoun, M.: Image-based 3d object reconstruction: State-of-the-art and trends in the deep learning era. IEEE transactions on pattern analysis and machine intelligence 43(5), 1578–1604 (2019)
  • [16] Hu, Y.L., Yin, B.C., Cheng, S.Q., Gu, C.L.: An improved morphable model for 3d face synthesis. In: Proceedings of 2004 International Conference on Machine Learning and Cybernetics (IEEE Cat. No. 04EX826). vol. 7, pp. 4362–4367. IEEE (2004)
  • [17] **, H., Liu, I., Xu, P., Zhang, X., Han, S., Bi, S., Zhou, X., Xu, Z., Su, H.: Tensoir: Tensorial inverse rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 165–174 (2023)
  • [18] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196 (2017)
  • [19] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019)
  • [20] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42(4), 1–14 (2023)
  • [21] Kim, H., Garrido, P., Tewari, A., Xu, W., Thies, J., Niessner, M., Pérez, P., Richardt, C., Zollhöfer, M., Theobalt, C.: Deep video portraits. ACM transactions on graphics (TOG) 37(4), 1–14 (2018)
  • [22] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [23] Kirschstein, T., Qian, S., Giebenhain, S., Walter, T., Nießner, M.: Nersemble: Multi-view radiance field reconstruction of human heads. ACM Transactions on Graphics (TOG) 42(4), 1–14 (2023)
  • [24] Li, J., Zhang, J., Bai, X., Zhou, J., Gu, L.: Efficient region-aware neural radiance fields for high-fidelity talking portrait synthesis. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7568–7578 (2023)
  • [25] Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph. 36(6), 194–1 (2017)
  • [26] Liu, L., Gu, J., Zaw Lin, K., Chua, T.S., Theobalt, C.: Neural sparse voxel fields. Advances in Neural Information Processing Systems 33, 15651–15663 (2020)
  • [27] Lombardi, S., Simon, T., Schwartz, G., Zollhoefer, M., Sheikh, Y., Saragih, J.: Mixture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG) 40(4), 1–13 (2021)
  • [28] Luo, H., Nagano, K., Kung, H.W., Xu, Q., Wang, Z., Wei, L., Hu, L., Li, H.: Normalized avatar synthesis using stylegan and perceptual refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11662–11672 (2021)
  • [29] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [30] Park, K., Sinha, U., Barron, J.T., Bouaziz, S., Goldman, D.B., Seitz, S.M., Martin-Brualla, R.: Nerfies: Deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5865–5874 (2021)
  • [31] Park, K., Sinha, U., Hedman, P., Barron, J.T., Bouaziz, S., Goldman, D.B., Martin-Brualla, R., Seitz, S.M.: Hypernerf: A higher-dimensional representation for topologically varying neural radiance fields. arXiv preprint arXiv:2106.13228 (2021)
  • [32] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019)
  • [33] Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3d face model for pose and illumination invariant face recognition. In: 2009 sixth IEEE international conference on advanced video and signal based surveillance. pp. 296–301. Ieee (2009)
  • [34] Qian, S., Kirschstein, T., Schoneveld, L., Davoli, D., Giebenhain, S., Nießner, M.: Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069 (2023)
  • [35] Ravi, N., Reizenstein, J., Novotny, D., Gordon, T., Lo, W.Y., Johnson, J., Gkioxari, G.: Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 (2020)
  • [36] Rivero, A., Athar, S., Shu, Z., Samaras, D.: Rig3dgs: Creating controllable portraits from casual monocular videos. arXiv preprint arXiv:2402.03723 (2024)
  • [37] Romdhani, S., Pierrard, J.S., Vetter, T.: 3d morphable face model, a unified approach for analysis and synthesis of images. Face Processing: Advanced Modeling and Methods p. 768 (2005)
  • [38] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)
  • [39] Shum, H., Kang, S.B.: Review of image-based rendering techniques. In: Visual Communications and Image Processing 2000. vol. 4067, pp. 2–13. SPIE (2000)
  • [40] Snavely, N., Seitz, S.M., Szeliski, R.: Photo tourism: exploring photo collections in 3d. In: ACM siggraph 2006 papers, pp. 835–846 (2006)
  • [41] Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5459–5469 (2022)
  • [42] Tewari, A., Thies, J., Mildenhall, B., Srinivasan, P., Tretschk, E., Yifan, W., Lassner, C., Sitzmann, V., Martin-Brualla, R., Lombardi, S., et al.: Advances in neural rendering. In: Computer Graphics Forum. vol. 41, pp. 703–735. Wiley Online Library (2022)
  • [43] Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: Audio-driven facial reenactment. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. pp. 716–731. Springer (2020)
  • [44] Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. International journal of computer vision 9, 137–154 (1992)
  • [45] Tran, L., Liu, F., Liu, X.: Towards high-fidelity nonlinear 3d face morphable model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1126–1135 (2019)
  • [46] Tran, L., Liu, X.: Nonlinear 3d face morphable model. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7346–7355 (2018)
  • [47] Wang, J., Xie, J.C., Li, X., Xu, F., Pun, C.M., Gao, H.: Gaussianhead: Impressive head avatars with learnable gaussian diffusion. arXiv preprint arXiv:2312.01632 (2023)
  • [48] Wang, M., Lyu, X.Q., Li, Y.J., Zhang, F.L.: Vr content creation and exploration with deep learning: A survey. Computational Visual Media 6, 3–28 (2020)
  • [49] Xiang, J., Gao, X., Guo, Y., Zhang, J.: Flashavatar: High-fidelity digital avatar rendering at 300fps. arXiv preprint arXiv:2312.02214 (2023)
  • [50] Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5438–5448 (2022)
  • [51] Xu, Y., Chen, B., Li, Z., Zhang, H., Wang, L., Zheng, Z., Liu, Y.: Gaussian head avatar: Ultra high-fidelity head avatar via dynamic gaussians. arXiv preprint arXiv:2312.03029 (2023)
  • [52] Xu, Y., Wang, L., Zhao, X., Zhang, H., Liu, Y.: Avatarmav: Fast 3d head avatar reconstruction using motion-aware neural voxels. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–10 (2023)
  • [53] Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A.: Plenoctrees for real-time rendering of neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5752–5761 (2021)
  • [54] Zackariasson, P., Wilson, T.L.: The video game industry: Formation, present state, and future. Routledge (2012)
  • [55] Zhang, C., Chen, T.: A survey on image-based rendering—representation, sampling and compression. Signal Processing: Image Communication 19(1), 1–28 (2004)
  • [56] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)
  • [57] Zheng, M., Yang, H., Huang, D., Chen, L.: Imface: A nonlinear 3d morphable face model with implicit neural representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20343–20352 (2022)
  • [58] Zheng, Y., Yifan, W., Wetzstein, G., Black, M.J., Hilliges, O.: Pointavatar: Deformable point-based head avatars from videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 21057–21067 (2023)
  • [59] Zielonka, W., Bolkart, T., Thies, J.: Towards metrical reconstruction of human faces. In: European Conference on Computer Vision. pp. 250–269. Springer (2022)
  • [60] Zielonka, W., Bolkart, T., Thies, J.: Instant volumetric head avatars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4574–4584 (2023)
  • [61] Zwicker, M., Pfister, H., Van Baar, J., Gross, M.: Ewa volume splatting. In: Proceedings Visualization, 2001. VIS’01. pp. 29–538. IEEE (2001)

Supplementary Material

7 Implementation Details

7.1 Network Architecture

In Fig. 13, we show the transform network structure, we mainly utilize a set of Fully Connected(FC) layers with Tanh activation function.

Refer to caption
Figure 13: Network architecture of the transform network.

7.2 Datapreprocess Pipeline

The data preprocessing pipeline is illustrated in Fig. 14. After obtaining the original capture videos, we first crop the head region. The background is then removed using a segmentation algorithm, and the shape vectors of the mesh are extracted via the MICA. Finally, the FLAME parameters are extracted using our modified tracker based on the Metrical Photometric Tracker.

Here, we conduct an ablation study as shown in Fig. 16, where we utilize the original tracker to extract the FLAME parameters. However, this leads to misalignment of the initialized mesh with the ground truth image, resulting in disorganized results.

Refer to caption
Figure 14: Data preprocessing pipeline.
Refer to caption
Figure 15: The effect of alignment the initialized mesh with the ground truth image.

8 Additional Results

8.1 Additional Reenactment Results

In Fig. 16, we present the additional reenactment results on our capturing datasets. Our approach demonstrates superior performance compared to the previous works.

Refer to caption
Figure 16: More qualitative results of ours and two other method on facial reenactment task.