GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

Kyusun Cho1, Joungbin Lee1, Heeji Yoon1, Yeobin Hong1, Jaehoon Ko1,
Sangjun Ahn2, and Seungryong Kim1
1 Korea University       2 NCSOFThttps://ku-cvlab.github.io/GaussianTalker/
Abstract.

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker’s superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed of 120 FPS, surpassing previous benchmarks.

Talking Head Generation, 3D Controllable Head, 3D Gaussian Splatting
ccs: Computing methodologies Reconstructionccs: Information systems Multimedia content creationccs: Computing methodologies 3D imaging
[Uncaptioned image]
Figure 1. Fidelity and inference time comparison between existing 3D talking face synthesis models (Guo et al., 2021; Tang et al., 2022; Li et al., 2023) and ours. Our method, GaussianTalker, achieves on par with or better results at much higher FPS. Note that we also include GaussianTalker, a more efficient and faster variant. Size of each bubble represents the inference time of each method.

1. Introduction

Generating a talking head video driven by arbitrary speech audio is a popular task that has various uses, including the generation of digital humans, virtual avatars, movie production, and teleconferencing (Wiles et al., 2018; Chen et al., 2019; Jamaludin et al., 2019; Prajwal et al., 2020; Zhang et al., 2023; Suwajanakorn et al., 2017; Thies et al., 2020; Song et al., 2022). While various works (Wiles et al., 2018; Chen et al., 2019; Jamaludin et al., 2019; Prajwal et al., 2020) have successfully attempted to solve this task using generative models, they do not focus on controlling head poses, limiting their realism and applicability. Recently, numerous studies (Guo et al., 2021; Liu et al., 2022; Ye et al., 2022, 2023; Tang et al., 2022; Li et al., 2023) have applied neural radiance fields (NeRF) (Mildenhall et al., 2020) for the creation of pose controllable talking portraits. By directly conditioning audio features in the multi-layer perceptron (MLP) of NeRF, these methods can synthesize view-consistent 3D head structure with its lips synced to the input audio. Although these NeRF-based techniques achieve high-quality and consistent visual outputs, their slow inference speed limits their practicality. Despite recent advancements (Tang et al., 2022; Li et al., 2023) achieving rendering speeds up to 30 frames per second (fps) at 512×512512512512\times 512512 × 512 resolution, computational bottlenecks must be overcome to be applied in real-world scenarios.

Addressing this limitation, an intuitive solution is to leverage the fast rendering capabilities of 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023). Recently recognized as a viable alternative to NeRF, 3DGS offers comparable rendering quality while significantly improving inference speeds. Although 3DGS was initially proposed for reconstructing static 3D scenes, subsequent works have extended it to dynamic scenes (Wu et al., 2023; Yang et al., 2023b; Luiten et al., 2023; Yang et al., 2023a). However, there has been little research on leveraging 3DGS to create dynamic 3D scenes with controllable inputs, most of which focused on using an intermediate mesh representation to drive the 3D Gaussians (Chen et al., 2023; Qian et al., 2023; Hu and Liu, 2024; Liu et al., 2023; Li et al., 2024). However, relying on an intermediate 3D mesh representation, such as FLAME (Li et al., 2017), for deformation often lacks fine details in hair and facial wrinkles.

We identify two major challenges in directly map** the speech audio to the deformation of 3D Gaussians. First, the 3DGS representation lacks shared spatial information among the adjacent points, complicating its manipulation. The optimization process of 3DGS does not consider relationships between neighboring Gaussians, crucial for maintaining facial region cohesion during deformation. Secondly, the extensive parameter space and a substantial number of Gaussians pose a challenge to their manipulation. Unlike controllable NeRF representations where the position and the number of sampling points are fixed, the position, shape, and appearance attributes of numerous Gaussian points need to be deformed per frame, while also preserving the intricate facial details.

In this paper, we present GaussianTalker, a novel framework for real-time pose-controllable talking head synthesis. For the first time, we leverage the 3D Gaussian representation to exploit its fast scene modeling capability for audio-driven dynamic facial animation. We construct a static 3DGS representation of the canonical head shape and deform this in sync with the audio. Specifically, we employ a multi-resolution triplane to extract feature embeddings for each 3D Gaussian position, from which each Gaussian attribute is directly estimated. This design ensures that the triplane learns the spatial and semantic information of the 3D head, while the interpolation mechanism of the 2D feature grids efficiently enforces interactions between neighboring points. The feature embeddings are subsequently fed to the proposed spatial-audio attention module, where they are merged with the audio features to predict the frame-wise offsets for the attributes of each Gaussian. This module successfully models the relevance between audio features and the motions for each Gaussian primitive. The cross attention offers a more stable approach of manipulating the substantial number of Gaussians and their intricate parameter space, compared to concatenation (Guo et al., 2021; Tang et al., 2022) or multiplication (Li et al., 2023) as in previous works. Qualitative and quantitative experiments demonstrate GaussianTalker’s superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Additionally, we conduct ablation studies to verify the effectiveness of individual design choices within our model.

Our main contributions are summarized as follows:

  • For the first time, we present a novel audio-conditioned 3D Gaussian Splatting framework real-time 3D-aware talking head synthesis.

  • We reformulate the 3D Gaussian representation with a feature volume representation in order to enforce spatial consistency among adjacent Gaussians.

  • We integrate cross-attention mechanisms between audio and spatial features to improve stability and ensure region-specific deformation across a significant number of Gaussians.

2. Related Work

2.1. Audio-driven talking portrait synthesis

Audio-driven talking portrait synthesis aims to create realistic facial animations with accurate lip movements based on audio input. Early 2D GAN-based methods (Prajwal et al., 2020; Yu et al., 2020; Yin et al., 2022; Zhou et al., 2020; Sun et al., 2021) achieved photorealism but lacked control over head pose due to the absence of 3D geometry. In order to control the head poses, some works (Thies et al., 2020; Wang et al., 2020; Lu et al., 2021; Zhang et al., 2023) utilize model-based methods, where facial landmarks and 3D morphable models reinforce the lip sync model with the ability to adjust the orientation of the head. However, these approaches lead to new problems such as extra errors from the intermediate representations, and inaccuracies in identity preservation and realism.

Recently, Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) have been explored for talking portraits due to their ability to capture complex scenes. AD-NeRF (Guo et al., 2021) pioneered using NeRF’s implicit representation for conditional audio input, but separate networks for head and torso limited its flexibility. Subsequent NeRF-based methods (Yao et al., 2022; Shen et al., 2022; Liu et al., 2022) achieved high quality but suffered from slow rendering speeds. While RAD-NeRF (Tang et al., 2022) and ER-NeRF (Li et al., 2023) improved efficiency and quality with grid-based NeRF (Müller et al., 2022), real-time rendering of pose-controllable 3D talking head remains challenging.

2.2. 3D Gaussian splatting

3DGS (Kerbl et al., 2023) is a pioneering technique in point cloud rendering that utilizes a multitude of ellipsoidal, anisotropic balls to precisely represent a scene. Each point embodies a 3D Gaussian distribution, with its mean, covariance, opacity, and spherical harmonics parameters optimized to accurately capture the scene’s shapes and appearances. This approach effectively resolves common issues in point rendering, such as output gaps. Furthermore, combined with a tile-based rasterization algorithm, it facilitates expedited training and real-time rendering capabilities. Recently, 3DGS has gained widespread application in 3D vision tasks such as object manipulation (Fang et al., 2023; Gao et al., 2024), reconstruction (Kerbl et al., 2023; Fang et al., 2022), and perception (Cen et al., 2023; Luiten et al., 2023) within 3D environments.

Refer to caption
Figure 2. Overview of our GaussianTalker framework. GaussianTalker utilizes a multi-resolution triplane to leverage different scales of features depicting a canonical 3D head. These features are fed into a spatial-audio attention module along with the audio feature to predict per-frame deformations, enabling fast and reliable talking head synthesis.

2.3. Facial animation with 3DGS

Previous methods for facial reconstrution and animation primarily relied on 3D Morphable Models(3DMM) (Grassal et al., 2022; Khakhulin et al., 2022) or utilized neural implicit representations (Zheng et al., 2022; Athar et al., 2022; Gao et al., 2022). Recent approaches (Qian et al., 2023; Wang et al., 2024; Chen et al., 2023; Dhamo et al., 2023) have shifted towards adopting the 3DGS representation, aiming to leverage the benefits of rapid training and rendering while still achieving competitive levels of photorealism. GaussianAvatars (Qian et al., 2023) reconstructed head avatars by rigging 3D Gaussians on FLAME (Li et al., 2017) mesh. MonoGaussianAvatar (Chen et al., 2023) learned explicit head avatars by shifting the mean position of 3D Gaussians from canonical to deformed space using Linear Blend Skinning (LBS) and simultaneously adjusts other Gaussian parameters through a deformation field. GaussianHead (Wang et al., 2024) adopted a motion deformation field to adapt to facial movements while preserving head geometry and separately utilized a tri-plane to retain the appearance information of individual 3D Gaussians. However, the aforementioned methods tend to depend on parametric models for facial animation. In contrast to previous works, our audio-driven method is not only free from the need for data beyond the speech sequence for facial reenactment but also is readily applicable to novel audio.

3. Preliminary: 3D Gaussian Splatting

3D Gaussian splatting (3DGS) (Kerbl et al., 2023) employs anisotropioc 3D Gaussians as geometric primitives for learning an explicit 3D representation. Each 3D Gaussian is defined by a center mean μ3𝜇superscript3\mu\in\mathbb{R}^{3}italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and covariance matrix Σ3×3Σsuperscript33\Sigma\in\mathbb{R}^{3\times 3}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 end_POSTSUPERSCRIPT in the 3D coordinate as follows:

(1) g(x)=exp(12(xμ)T𝚺1(xμ)),𝑔𝑥12superscript𝑥𝜇𝑇superscript𝚺1𝑥𝜇g(x)=\exp\left({-\frac{1}{2}(x-\mu)^{T}\mathbf{\Sigma}^{-1}(x-\mu)}\right),italic_g ( italic_x ) = roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_μ ) ) ,

for a 3D coordinate x3𝑥superscript3x\in\mathbb{R}^{3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. The covariance matrix ΣΣ\Sigmaroman_Σ is further decomposed into Σ=RSSTRTΣ𝑅𝑆superscript𝑆𝑇superscript𝑅𝑇\Sigma=RSS^{T}R^{T}roman_Σ = italic_R italic_S italic_S start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT with a scaling matrix S𝑆Sitalic_S and a rotation matrix R𝑅Ritalic_R, defined by a scaling factor s3𝑠superscript3s\in\mathbb{R}^{3}italic_s ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT and a learnable quaternion r4𝑟superscript4r\in\mathbb{R}^{4}italic_r ∈ blackboard_R start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT, respectively. Additionally, to encode the appearance information, each 3D Gaussian contains a set of spherical harmonics with degree k𝑘kitalic_k such that SH3(k+1)(k+1)𝑆𝐻superscript3𝑘1𝑘1SH\in\mathbb{R}^{3(k+1)(k+1)}italic_S italic_H ∈ blackboard_R start_POSTSUPERSCRIPT 3 ( italic_k + 1 ) ( italic_k + 1 ) end_POSTSUPERSCRIPT, along with an opacity value α𝛼\alpha\in\mathbb{R}italic_α ∈ blackboard_R. In summary, 3DGS represents a 3D scene with a set of 3D Gaussians parameters, defined as:

(2) 𝒢={μ,r,s,SH,α},𝒢𝜇𝑟𝑠𝑆𝐻𝛼\mathcal{G}=\{\mu,r,s,SH,\alpha\},caligraphic_G = { italic_μ , italic_r , italic_s , italic_S italic_H , italic_α } ,

Given a novel viewing direction π𝜋\piitalic_π, a 2D image I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG is rendered as:

(3) I^=(𝒢;π),^𝐼𝒢𝜋\hat{I}=\mathcal{R}(\mathcal{G};\pi),over^ start_ARG italic_I end_ARG = caligraphic_R ( caligraphic_G ; italic_π ) ,

where ()\mathcal{R}(\cdot)caligraphic_R ( ⋅ ) is the differentiable rasterizer.

More specifically, for ()\mathcal{R}(\cdot)caligraphic_R ( ⋅ ), 3DGS employs differential splatting (Yifan et al., 2019) during novel view rendering. In order to project 3D Gaussians to 2D for rendering, the covariance matrix in the 2D space, Σ2×2superscriptΣsuperscript22\Sigma^{\prime}\in\mathbb{R}^{2\times 2}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT, is calculated by viewing transform W𝑊Witalic_W and the Jacobian J𝐽Jitalic_J of the affine approximation of the projective transformation (Zwicker et al., 2001), such as:

(4) Σ=JWΣWTJT.superscriptΣ𝐽𝑊Σsuperscript𝑊𝑇superscript𝐽𝑇\Sigma^{\prime}=JW\Sigma W^{T}J^{T}.roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_J italic_W roman_Σ italic_W start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Subsequently, the color of each pixel is computed by blending all Gaussians that overlap the pixel and ordered by their depths as follows:

(5) C=i=1ciαij=1i1(1αj),𝐶subscript𝑖1subscript𝑐𝑖subscriptsuperscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscriptsuperscript𝛼𝑗C=\sum_{i=1}c_{i}{\alpha}^{\prime}_{i}\prod_{j=1}^{i-1}(1-{\alpha}^{\prime}_{j% }),italic_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the color of each point determined using the SH coefficient with view direction, and αisuperscriptsubscript𝛼𝑖{\alpha}_{i}^{\prime}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is computed by the multiplication of the opacity α𝛼\alphaitalic_α of the 3D Gaussian and its projected covariance ΣsuperscriptΣ\Sigma^{\prime}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

4. Methodology

4.1. Problem formulation and Overview

In this section, we describe the main components of GaussianTalker, designed for the real-time synthesis of high-fidelity, pose-controllable talking head images driven by audio input. Our model is trained on a talking portrait video 𝒱={In}𝒱subscript𝐼𝑛\mathcal{V}=\{I_{n}\}caligraphic_V = { italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } consisting of N𝑁Nitalic_N number of image frames for an identity. Our objective is to reconstruct a set of canonical 3D Gaussians that represent the mean shape of the talking head, and learn a deformation module that deforms the 3D Gaussians according to corresponding input audio. During inference, for the input audio ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the deformation module predicts the offsets of each Gaussian attribute, and the deformed Gaussians are rasterized at the viewing point πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to output the novel image I^nsubscript^𝐼𝑛\hat{I}_{n}over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

An overview of our proposed method is depicted in Fig. 2. We first introduce the multi-resolution tri-plane that encodes the low-dimensional features of the 3D Gaussians to represent the static mean shape of the canonical head in Sec. 4.2. In Sec. 4.3, we introduce the speech-motion cross-attention module that fuses 3D Gaussians features and audio features to accurately model facial motion driven by input audio. Finally, Sec. 4.4 describes the stage-wise training strategy and the utilized loss functions.

4.2. Learning canonical 3D Gaussians with triplane representation

In this section, we introduce the details of learning the canonical shape of the talking head with 3D Gaussian representation. The vanilla implementation of 3DGS (Kerbl et al., 2023) does not inherently capture the spatial relationships between neighboring and distant 3D Gaussians. However, an ideal feature representation for a dynamic 3D head should be analogous for proximal facial regions and distinct for separated ones, as the close facial primitives would likely move to the same direction.

To realize this, we modify the 3D Gaussian representation by learning a low-dimensional feature representation, which can be later merged with the audio features for per-Gaussian deformation. We formulate the embedding space to encode information of the attributes of the 3D Gaussians, in order to take into account the shape and appearance of each Gaussian when predicting its deformation offsets. More specifically, we adopt a hybrid 3D representation that utilizes the explicit 3D representation of 3DGS, while also taking advantage of the encoded spatial information of implicit neural radiance fields (Mildenhall et al., 2020). For each of the canoncial 3D positions μcsubscript𝜇𝑐\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we extract feature embeddings f(μc)𝑓subscript𝜇𝑐f(\mu_{c})italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) from a multi-resolution triplane representation (Chan et al., 2022; Fridovich-Keil et al., 2023; Cao and Johnson, 2023). These feature embeddings are utilized to calculate the scale scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, rotation rcsubscript𝑟𝑐r_{c}italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, spherical harmonics SHc𝑆subscript𝐻𝑐SH_{c}italic_S italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and opacity αcsubscript𝛼𝑐\alpha_{c}italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of each point. These computed attributes make up the canonical 3D Gaussian of the talking head, denoted as:

(6) 𝒢can={μc,rc,sc,SHc,αc}.subscript𝒢cansubscript𝜇𝑐subscript𝑟𝑐subscript𝑠𝑐𝑆subscript𝐻𝑐subscript𝛼𝑐\mathcal{G}_{\mathrm{can}}=\{\mu_{c},r_{c},s_{c},SH_{c},\alpha_{c}\}.caligraphic_G start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT = { italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_S italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } .

During training, instead of directly updating the 3D Gaussian attributes, the feature grids of the triplane and the attribute prediction networks are optimized. This allows for the feature embedding f(μc)𝑓subscript𝜇𝑐f(\mu_{c})italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) to store the region-specific facial information of the canonical 3D head, while also enforcing spatial relationships between neighboring Gaussians. In the following, we introduce the formulation of each module in detail.

4.2.1. Triplane representation for 3D Gaussian

rendered Pxysuperscript𝑃xyP^{\mathrm{xy}}italic_P start_POSTSUPERSCRIPT roman_xy end_POSTSUPERSCRIPT Pyzsuperscript𝑃yzP^{\mathrm{yz}}italic_P start_POSTSUPERSCRIPT roman_yz end_POSTSUPERSCRIPT Pzxsuperscript𝑃zxP^{\mathrm{zx}}italic_P start_POSTSUPERSCRIPT roman_zx end_POSTSUPERSCRIPT
Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption
Figure 3. Visualization of the triplane feature grids. The sequence displays a rendered image, followed by its orthographic projected embeddings: frontal (xy), overhead (yz), and side (zx) views. The embeddings are visualized by reducing its dimenstion to 3 using Principal Component Analysis.

In order to encode the spatial information of the canonical 3D head, we adopt a multi-resolution triplane representation, constructed by three orthogonal 2D feature grids, P={Pxy,Pyz,Pzx}𝑃superscript𝑃xysuperscript𝑃yzsuperscript𝑃zxP=\{P^{\mathrm{xy}},P^{\mathrm{yz}},P^{\mathrm{zx}}\}italic_P = { italic_P start_POSTSUPERSCRIPT roman_xy end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT roman_yz end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT roman_zx end_POSTSUPERSCRIPT }. Each of these planes has shape H×R×R𝐻𝑅𝑅{H\times R\times R}italic_H × italic_R × italic_R, where H𝐻Hitalic_H stands for the hidden dimension of features, and R𝑅Ritalic_R denotes the resolution of each dimension. For individual 3D Gaussian with position μ𝜇\muitalic_μ, each of its coordinate values is normalized between [0,R)0𝑅[0,R)[ 0 , italic_R ), and its corresponding features are computed by interpolating the point into a regularly spaced 2D grid for each plane. These features are combined using the Hadamard product product\prod for each plane, followed by concatenation \bigcup along the different dimensions, to produce a final feature vector f(μ)𝑓𝜇f(\mu)italic_f ( italic_μ ) of length H𝐻Hitalic_H for each of the canonical Gaussian position μcsubscript𝜇𝑐\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, such as:

(7) f(μ)=pPinterp(p,ζp(μc)),𝑓𝜇subscriptproduct𝑝𝑃interp𝑝subscript𝜁𝑝subscript𝜇𝑐f(\mu)=\bigcup\prod_{p\in P}\mathrm{interp}\big{(}p,\zeta_{p}(\mu_{c})\big{)},italic_f ( italic_μ ) = ⋃ ∏ start_POSTSUBSCRIPT italic_p ∈ italic_P end_POSTSUBSCRIPT roman_interp ( italic_p , italic_ζ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ,

where ζp(μ)subscript𝜁𝑝𝜇\zeta_{p}(\mu)italic_ζ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ ) denotes a projection of μ𝜇\muitalic_μ onto the p𝑝pitalic_p’th plane and ‘interpinterp\mathrm{interp}roman_interp’ denotes bilinear interpolation of a point into the regularly spaced 2D grid. The visualization of features in our multi-resolution triplane is depicted in Fig. 3.

4.2.2. Attribute prediction of canonical 3D Gaussians

Unlike the original 3DGS implementation shown in (2), we do not explicitly store the shape information r𝑟ritalic_r and s𝑠sitalic_s, and the appearance information SH𝑆𝐻SHitalic_S italic_H and α𝛼\alphaitalic_α. Instead, these attributes are obtained from the corresponding feature representation f(μ)𝑓𝜇f(\mu)italic_f ( italic_μ ). Specifically, we employ a set of MLP layers, denoted as can()subscriptcan\mathcal{F}_{\mathrm{can}}(\cdot)caligraphic_F start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT ( ⋅ ), to map the feature to the mean scale scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, mean rotation rcsubscript𝑟𝑐r_{c}italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, mean spherical harmonics SHc𝑆subscript𝐻𝑐SH_{c}italic_S italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and mean opacity value αcsubscript𝛼𝑐\alpha_{c}italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT from f(μ)𝑓𝜇f(\mu)italic_f ( italic_μ ), such as:

(8) {sc,rc,SHc,αc}=can(f(μ)).subscript𝑠𝑐subscript𝑟𝑐𝑆subscript𝐻𝑐subscript𝛼𝑐subscriptcan𝑓𝜇\{s_{c},r_{c},SH_{c},\alpha_{c}\}={\mathcal{F}}_{\mathrm{can}}\big{(}f(\mu)% \big{)}.{ italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_S italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } = caligraphic_F start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT ( italic_f ( italic_μ ) ) .

Compared to the original 3DGS (Kerbl et al., 2023) where each Gaussian is optimized independently, our hybrid representation conditioned on an implicit feature volume enforces shared facial information between adjacent points.

4.3. Learning audio-driven deformation of 3D Gaussians

Previous works (Guo et al., 2021; Liu et al., 2022; Tang et al., 2022; Li et al., 2023) employ a conditional NeRF representation, wherein the 3D coordinates of the sampling point along each ray remain fixed, with only color and density conditioned to input audio. However, in order to fully benefit from the explicit representation of 3DGS, we choose to deform the 3D Gaussians, where we manipulate not only the appearance information but also the spatial positions and shape of each Gaussian primitive. While this can more accurately capture the constantly fluctuating 3D shape of the talking head, deformation of 3D Gaussians is a much more complex task compared to controlling a NeRF representation. The intricate nature of Gaussian primitives, coupled with their sheer quantity, presents significant challenges for deformation due to the extensive parameter space of 3D Gaussians. In addition, input audio does not impact the whole facial image uniformly, making it vital for the deformation module to understand how varying facial regions respond to audio conditions for authentic facial animation.

In order to model the relations between the dynamic features and the vast amount of 3D Gaussians, we fuse the input speech audio ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with the encoded feature f(μc)𝑓subscript𝜇𝑐f({\mu}_{c})italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) in an attention mechanism, in order to produce the audio-aware feature hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for the n𝑛nitalic_n-th image frame. The deformation offsets of each Gaussian attribute for subsequent frames are directly conditioned on the feature hnsubscript𝑛h_{n}italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Finally, the deformed set of 3D Gaussian for the n𝑛nitalic_n-th image frame is defined as:

(9) 𝒢deform,n={μc+Δμn,rc+Δrn,sc+Δsn,SHc+ΔSHn,αc+Δαn},subscript𝒢deform𝑛subscript𝜇𝑐Δsubscript𝜇𝑛subscript𝑟𝑐Δsubscript𝑟𝑛subscript𝑠𝑐Δsubscript𝑠𝑛𝑆subscript𝐻𝑐Δ𝑆subscript𝐻𝑛subscript𝛼𝑐Δsubscript𝛼𝑛\mathcal{G}_{\mathrm{deform},n}=\{{\mu}_{c}+\Delta\mu_{n},r_{c}+\Delta r_{n},s% _{c}+\Delta s_{n},SH_{c}+\Delta SH_{n},{\alpha}_{c}+\Delta\alpha_{n}\},caligraphic_G start_POSTSUBSCRIPT roman_deform , italic_n end_POSTSUBSCRIPT = { italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + roman_Δ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + roman_Δ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + roman_Δ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_S italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + roman_Δ italic_S italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + roman_Δ italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ,

where Δμn,Δsn,Δrn,ΔSHn,ΔαnΔsubscript𝜇𝑛Δsubscript𝑠𝑛Δsubscript𝑟𝑛Δ𝑆subscript𝐻𝑛Δsubscript𝛼𝑛\Delta\mu_{n},\Delta{s}_{n},\Delta{r}_{n},\Delta{SH}_{n},\Delta{\alpha_{n}}roman_Δ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_S italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the deformation offsets at n𝑛nitalic_n-th frame for 3D position, scale, rotation, spherical harmonics parameters and opacity, respectively. The details of each module is introduced in the following paragraphs.

4.3.1. Spatial-audio cross-attention

Previous approaches to implement region-aware audio, like ER-NeRF (Li et al., 2023), simply adjust the weights for the audio features at each 3D point through elementwise multiplication. However, it encounters a challenge in that, regardless of the diverse audio inputs in a dynamic scene, a particular static 3D point consistently maintains the same audio weight. This fails to acknowledge that a fixed 3D coordinate may not consistently correspond to the same facial region as the scene progresses. To address this issue and enhance the extraction of spatial-audio features, we introduce spatial-audio cross-attention module, a cross-attention mechanism that merges spatial feature embedding f(μc)𝑓subscript𝜇𝑐f(\mu_{c})italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) of the canonical 3D Gaussians with subsequent audio features, capturing how the input speech audio affects the movement of the 3D Gaussians. The spatial-audio cross-attention module comprises L𝐿Litalic_L sets of cross-attention layer 𝒯CA()subscript𝒯𝐶𝐴\mathcal{T}_{CA}(\cdot)caligraphic_T start_POSTSUBSCRIPT italic_C italic_A end_POSTSUBSCRIPT ( ⋅ ) and feed-forward layer FFN()𝐹𝐹𝑁FFN(\cdot)italic_F italic_F italic_N ( ⋅ ), each interconnected with skip connections. The module is formulated as:

(10) zn0=f(μc),superscriptsubscript𝑧𝑛0𝑓subscript𝜇𝑐z_{n}^{0}=f(\mu_{c}),italic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ,
(11) znl=𝒯CA(znl1,an)+znl1,l=1L,formulae-sequencesuperscriptsubscriptsuperscript𝑧𝑛𝑙subscript𝒯𝐶𝐴subscriptsuperscript𝑧𝑙1𝑛subscript𝑎𝑛subscriptsuperscript𝑧𝑙1𝑛𝑙1𝐿{z^{\prime}_{n}}^{l}=\mathcal{T}_{CA}(z^{l-1}_{n},a_{n})+z^{l-1}_{n},\quad l=1% ...L,italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUBSCRIPT italic_C italic_A end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + italic_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_l = 1 … italic_L ,
(12) znl=FFN(znl)+znl,l=1L,formulae-sequencesubscriptsuperscript𝑧𝑙𝑛𝐹𝐹𝑁superscriptsubscriptsuperscript𝑧𝑛𝑙superscriptsubscriptsuperscript𝑧𝑛𝑙𝑙1𝐿z^{l}_{n}=FFN({z^{\prime}_{n}}^{l})+{z^{\prime}_{n}}^{l},\quad l=1...L,italic_z start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_F italic_F italic_N ( italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) + italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_l = 1 … italic_L ,

whereby the cross-attention between the spatial feature f𝑓fitalic_f and the audio feature ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the n𝑛nitalic_n-th image frame is computed. As a result, the output feature znLsubscriptsuperscript𝑧𝐿𝑛z^{L}_{n}italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT successfully amalgamates audio features with the rich facial details captured by each 3D Gaussian. This cross-attention module offers a more nuanced and stable method of feature combination than simple concatenation or multiplication, as the module reforms the spatial-aware facial features with respect to the subsequent audio features, taking into account the dynamic variability inherent in each 3D Gaussian.

4.3.2. Disentanglement of speech-related motion.

When synthesizing a talking head, the corresponding speech audio does not account for all the intricate and diverse facial movements. Subtle expressions like eye blinks and facial wrinkles, along with external factors such as hair movement and variations in lighting, do not directly correlate with input speech audio. Thereby, it is crucial to separate the non-verbal motions and scene variations when map** speech audio to the 3D Gaussian deformation. In this section, we address this challenge by introducing additional input conditions that capture non-verbal motions, allowing us to disentangle speech-related motion from the monocular video.

Following previous works (Tang et al., 2022; Li et al., 2023), we first apply explicit eye blinking control with the eye feature e𝑒eitalic_e. Specifically, we employ AU45 from the Facial Action Coding System (Ekman and Friesen, 1978) to describe the degree of the eye blink, and utilize a sinusoidal positional encoding in order to match the input dimensions. Additionally, we integrate the camera viewpoint as an auxiliary input to disentangle non-verbal scene variations. While we formulate the framewise camera πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as facial viewpoints, the typical video is recorded with a static camera while the head undergoes continuous movement. Consequently, variations in the portrait image, such as hair displacement and lighting changes, occur independently of the speech audio. Hence, we employ a facial viewpoint embedding υ𝜐\upsilonitalic_υ as an additional input condition to disentangle these non-auditory scene fluctuations. υnsubscript𝜐𝑛\upsilon_{n}italic_υ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is an embedding vector obtained by map** the extrinsic camera pose πnsubscript𝜋𝑛\pi_{n}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to a small MLP to have the same dimensionality as the other inputs. Finally, we discovered that using a single null-vector (\emptyset) for all frames promotes consistency as a global feature across video frames. We incorporate this null-vector as an additional input for our cross-attention network. Thus, we reformulate (11) as:

(13) znl=𝒯CA(znl1,{an,en,υn,})+znl1,l=1L.formulae-sequencesuperscriptsubscriptsuperscript𝑧𝑛𝑙subscript𝒯𝐶𝐴subscriptsuperscript𝑧𝑙1𝑛subscript𝑎𝑛subscript𝑒𝑛subscript𝜐𝑛subscriptsuperscript𝑧𝑙1𝑛𝑙1𝐿{z^{\prime}_{n}}^{l}=\mathcal{T}_{CA}(z^{l-1}_{n},\{a_{n},e_{n},\upsilon_{n},% \emptyset\})+z^{l-1}_{n},\quad l=1...L.italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_T start_POSTSUBSCRIPT italic_C italic_A end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , { italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ∅ } ) + italic_z start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_l = 1 … italic_L .

In Fig. 4, we visualize the attention scores for each input in order to demonstrate the efficacy of disentangling audio-related motion. Further details on the network structure and visualization procedure are provided in the supplementary file.

rendered audio eye blink viewpoint null
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4. Illustration of attention score distributions across different modalities for two individuals. From left to right: the original rendered image, attention scores responsible for audio cues, eye blink dynamics, head orientation (facial viewpoint), and temporal consistency (null), respectively.

4.3.3. Audio-conditioned deformation of 3D Gaussian

The final deformation network takes the spatially-aware audio features encoded in each 3D Gaussians in order to compute the deformation of position, rotation, and scaling. We define the set of MLP regressors deform()subscriptdeform\mathcal{F}_{\mathrm{deform}}(\cdot)caligraphic_F start_POSTSUBSCRIPT roman_deform end_POSTSUBSCRIPT ( ⋅ ) in order to predict the offsets of each Gaussian attributes, such as:

(14) {Δμn,Δsn,Δrn,ΔSHn,Δαn}=deform(znL).Δsubscript𝜇𝑛Δsubscript𝑠𝑛Δsubscript𝑟𝑛Δ𝑆subscript𝐻𝑛Δsubscript𝛼𝑛subscriptdeformsubscriptsuperscript𝑧𝐿𝑛\{\Delta\mu_{n},\Delta{s}_{n},\Delta{r}_{n},\Delta{SH}_{n},\Delta{\alpha_{n}}% \}={\mathcal{F}}_{\mathrm{deform}}(z^{L}_{n}).{ roman_Δ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_S italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_Δ italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } = caligraphic_F start_POSTSUBSCRIPT roman_deform end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .
Table 1. Quantitative comparison under the self-driven setting. The top, second-best, and third-best results are shown in red, orange, and yellow, respectively.
Methods PSNR \uparrow SSIM \uparrow LPIPS \downarrow FID \downarrow LMD \downarrow AUE \downarrow Sync \uparrow CSIM \uparrow Training Time \downarrow FPS\uparrow
Ground Truth N/A 1111 00 00 00 00 8.6538.6538.6538.653 1111 N/A N/A
Wav2Lip (Prajwal et al., 2020) 30.46130.46130.46130.461 0.9110.9110.9110.911 0.0240.0240.0240.024 33.07433.07433.07433.074 4.4584.4584.4584.458 1.7611.7611.7611.761 9.6069.6069.6069.606 0.8870.8870.8870.887 - 19
PC-AVS (Zhou et al., 2021) 21.95821.95821.95821.958 0.6990.6990.6990.699 0.0530.0530.0530.053 42.64642.64642.64642.646 4.6194.6194.6194.619 1.8751.8751.8751.875 9.1859.1859.1859.185 0.5190.5190.5190.519 - 32
AD-NeRF (Guo et al., 2021) 30.34130.34130.34130.341 0.9060.9060.9060.906 0.0260.0260.0260.026 20.24320.24320.24320.243 5.6925.6925.6925.692 2.3312.3312.3312.331 4.9394.9394.9394.939 0.9080.9080.9080.908 13131313h 0.13
RAD-NeRF (Tang et al., 2022) 30.70330.70330.70330.703 0.9150.9150.9150.915 0.0260.0260.0260.026 26.23826.23826.23826.238 3.1423.1423.1423.142 2.1962.1962.1962.196 5.7575.7575.7575.757 0.9110.9110.9110.911 3333h 32
ER-NeRF (Li et al., 2023) 31.67331.67331.67331.673 0.9190.9190.9190.919 0.0140.0140.0140.014 19.82919.82919.82919.829 3.0033.0033.0033.003 1.9741.9741.9741.974 5.9765.9765.9765.976 0.9220.9220.9220.922 1111h 34
GaussianTalker 32.26932.26932.26932.269 0.9300.9300.9300.930 0.0160.0160.0160.016 10.77110.77110.77110.771 2.7112.7112.7112.711 1.7581.7581.7581.758 6.4436.4436.4436.443 0.9330.9330.9330.933 1111h 121
GaussianTalker 32.42332.42332.42332.423 0.9310.9310.9310.931 0.0180.0180.0180.018 11.95111.95111.95111.951 2.9282.9282.9282.928 2.2922.2922.2922.292 6.5546.5546.5546.554 0.9320.9320.9320.932 1.5h 98

4.4. Training

4.4.1. Stage-wise optimization

3DGS (Kerbl et al., 2023) showed that the quality of reconstruction is influenced by the initialization of 3D Gaussians. Similarly, the training of the deformation field should also be conducted using a proper initialization of the canonical facial shape. To this end, we employ a two-stage training approach.

In the first stage, canonical stage, we first reconstruct the mean shape of the talking face, by optimizing the positions of 3D Gaussians and the multi-resolution triplane. Instead of the conventional initialization using structure from motion (SFM) points, we opt to utilize the 3D coordinates of the mesh vertices from fitting 3D morphable models. Note that the 3DMM fitting of each frame involves no extra preprocessing, as this is a necessary part of obtaining the camera parameters of the talking face and is widely adopted in NeRF-based talking face synthesis works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023). The static image of the canonical talking head is rasterized via:

(15) I^can=R(𝒢can;πn).subscript^𝐼can𝑅subscript𝒢cansubscript𝜋𝑛\hat{I}_{\mathrm{can}}=R(\mathcal{G}_{\mathrm{can}};\pi_{n}).over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT = italic_R ( caligraphic_G start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

This is followed by the deformation stage, where we optimize the whole network, from which we learn the cross-attention deformation network. For each frame, the dynamic talking head video frame can be rendered as:

(16) I^n=R(𝒢deform,n;πn).subscript^𝐼𝑛𝑅subscript𝒢deform𝑛subscript𝜋𝑛\hat{I}_{n}=R(\mathcal{G}_{\mathrm{deform},n};\pi_{n}).over^ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_R ( caligraphic_G start_POSTSUBSCRIPT roman_deform , italic_n end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

4.4.2. Loss Functions

For the canonical stage for a static shape of talking head, we follow the original 3DGS implementation (Kerbl et al., 2023) and utilize a combination of L1 color loss 1subscript1\mathcal{L}_{\mathrm{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and a D-SSIM term DSSIMsubscriptDSSIM\mathcal{L}_{\mathrm{D-SSIM}}caligraphic_L start_POSTSUBSCRIPT roman_D - roman_SSIM end_POSTSUBSCRIPT. Following previous audio-driven NeRF works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023), we also utilize LPIPS (Zhang et al., 2018) loss lpipssubscriptlpips\mathcal{L}_{\mathrm{lpips}}caligraphic_L start_POSTSUBSCRIPT roman_lpips end_POSTSUBSCRIPT to capture sharp details. For a given input frame I𝐼Iitalic_I, the overall loss function of the canonical stage is denoted as can=L1+λlpipslpips+λDSSIMDSSIMsubscriptcansubscriptL1subscript𝜆lpipssubscriptlpipssubscript𝜆DSSIMsubscriptDSSIM\mathcal{L}_{\mathrm{can}}=\mathcal{L}_{\mathrm{L1}}+\lambda_{\mathrm{lpips}}% \mathcal{L}_{\mathrm{lpips}}+\lambda_{\mathrm{D-SSIM}}\mathcal{L}_{\mathrm{D-% SSIM}}caligraphic_L start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT L1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT roman_lpips end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_lpips end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT roman_D - roman_SSIM end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_D - roman_SSIM end_POSTSUBSCRIPT. During the deformation stage, we employ an additional loss function on the lip area of the talking head. Specifically, we apply a reconstruction loss for the image patch obtained by crop** where the lips are located based on the facial landmarks (Bulat and Tzimiropoulos, 2017). Thus, the total loss function for the deformation stage can be formulated as deform=can+λliplipsubscriptdeformsubscriptcansubscript𝜆lipsubscriptlip\mathcal{L}_{\mathrm{deform}}=\mathcal{L}_{\mathrm{can}}+\lambda_{\mathrm{lip}% }\mathcal{L}_{\mathrm{lip}}caligraphic_L start_POSTSUBSCRIPT roman_deform end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT roman_lip end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT roman_lip end_POSTSUBSCRIPT. Note that the deformed 3D Gaussians are directly splatted onto the combined background and torso image, in order to render the head with the background and torso, a common technique that prevents noise around the facial contours (Tang et al., 2022; Li et al., 2023). A more detailed explanation of this technique can be found in the supplementary file.

5. Experiments

Table 2. Quantitative comparison under the cross-driven setting. We extract two audio clips from SynObama demo (Suwajanakorn et al., 2017) to drive each method and compare lip synchronization.
Testset A Testset B
Methods Sync\uparrow LMD\downarrow AUE\downarrow Sync\uparrow LMD\downarrow AUE\downarrow
Ground Truth 7.8507.8507.8507.850 00 00 6.9766.9766.9766.976 00 00
Wav2Lip (Prajwal et al., 2020) 8.2728.2728.2728.272 7.1027.1027.1027.102 2.0232.0232.0232.023 7.9077.9077.9077.907 5.5915.5915.5915.591 3.1643.1643.1643.164
PC-AVS (Zhou et al., 2021) 8.4088.4088.4088.408 7.7317.7317.7317.731 2.2122.2122.2122.212 7.5927.5927.5927.592 6.2306.2306.2306.230 3.1233.1233.1233.123
AD-NeRF (Guo et al., 2021) 5.1285.1285.1285.128 18.98618.98618.98618.986 3.6543.6543.6543.654 5.1095.1095.1095.109 9.2219.2219.2219.221 3.2663.2663.2663.266
RAD-NeRF (Tang et al., 2022) 5.1265.1265.1265.126 12.48512.48512.48512.485 3.6113.6113.6113.611 4.4974.4974.4974.497 7.7607.7607.7607.760 3.4473.4473.4473.447
ER-NeRF (Li et al., 2023) 4.6944.6944.6944.694 12.47712.47712.47712.477 3.7793.7793.7793.779 4.8224.8224.8224.822 7.6987.6987.6987.698 3.2873.2873.2873.287
GaussianTalker 5.3565.3565.3565.356 12.70212.70212.70212.702 3.6633.6633.6633.663 5.4135.4135.4135.413 7.8127.8127.8127.812 3.2653.2653.2653.265

5.1. Experimental Settings

5.1.1. Dataset and pre-processing.

For each target subject, we require several minutes of talking portrait video with a corresponding audio track for training. Specifically, the datasets are obtained from publicly-released video datasets utilized in previous NeRF-based works (Guo et al., 2021; Liu et al., 2022; Shen et al., 2022; Ye et al., 2022), averaging 6,000 frames for each video at 25 fps. We also perform experiments on selected video clips sourced from the HDTF dataset. (Zhang et al., 2021). Each portrait video is cropped and resized to 512×512512512512\times 512512 × 512, apart from the Obama video, which is of the resolution 450×450450450450\times 450450 × 450. We split each video into train and test sets at a ratio of 10:1, following the pre-processing steps introduced in AD-NeRF (Guo et al., 2021).

5.1.2. Comparison baselines.

We comparatively evaluate our proposed GaussianTalker framework against recent NeRF-based approaches tackling the same task. We introduce two variants of our method: the full model GaussianTalker with L=2𝐿2L=2italic_L = 2 cross-attention layers and a lightweight version, GaussianTalker, with L=1𝐿1L=1italic_L = 1 layer. Our method is compared with the recent NeRF-based approaches that address the same problem settings. We utilize three models as baselines: AD-NeRF (Guo et al., 2021), RAD-NeRF (Tang et al., 2022), and ER-NeRF (Li et al., 2023). For fair comparison, we implement each method by utilizing the torso part from the ground-truth frames. Additionally, we include a comparison with one-shot 2D talking head models, such as Wav2Lip (Prajwal et al., 2020) and PC-AVS (Zhou et al., 2021), to provide a wide range of comparisons.

country of crime we up especially like \langle mute \rangle
Ground Truth Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Wav2Lip Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
PC-AVS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
AD-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RAD-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ER-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5. Comparative visualization of lip synchronization across different audio-visual models. The sequence depicts the lip shape conforming to specific phonemes in the spoken words ’country’, ’of’, ’crime’, ’we’, ’up’, ’especially’, ’like’, with the last frame showing a closed mouth (’mute’).

5.2. Quantitative Evaluation

5.2.1. Comparison settings and metrics.

Following previous works (Tang et al., 2022; Li et al., 2023), our comparisons are structured into two distinct settings: self-driven and cross-driven. In the self-driven setting, we evaluate the accuracy of head reconstruction for a particular identity using the test subset. We employ several reconstruction metrics including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS). Notably, these metrics are exclusively measured on the facial region. We also measure realism of the reconstructed face using Fréchet Inception Distance (FID(Heusel et al., 2017) and identity preservation of the animated video using Cosine Similarity of Identity Embedding (CSIM(Huang et al., 2020).

For the cross-driven setting, all methods are driven by entirely unrelated audio tracks to evaluate lip synchronization. The audio clips used in this setup were extracted from demos of SynObama (Suwajanakorn et al., 2017). Due to the absence of ground-truth images, we assess lip sync accuracy with landmark distance (LMD) and SyncNet confidence score (Sync). We also employ action units error (AUE) to measure the precision of facial movements. Finally, we compare the training time and frames-per-second (FPS) as measures to evaluate the efficiency of each method.

5.2.2. Self-driven evaluation.

The self-driven evaluation results are presented in Tab. 1. Note that Wav2Lip (Prajwal et al., 2020) scores for PSNR, SSIM and LPIPS are not valid as it takes ground truth images as input. While the one-shot 2D-based methods, Wav2Lip and PC-AVS generate results with high synchronization scores, they fall short in the faithful reconstruction, showing low PSNR and LPIPS scores. Benefiting from the 3DGS representation, GaussianTalker achieves comparable image fidelity with significantly faster rendering speeds (over 120 fps for GaussianTalker*). Our method also shows the best scores in most metrics while reaching higher score than other NeRF-based baselines in Sync scores. The results show that our method can synthesize high lip-sync accurate 3D heads in real time rendering speeds.

5.2.3. Cross-driven evaluation.

Results in Table 2 showcase successful lip movement synthesis with general audio input. GaussianTalker consistently exhibits the highest Sync score among NeRF-based methods, demonstrating its effectiveness in handling unseen audio for lip synchronization. These results highlight GaussianTalker’s ability to generate high-fidelity 3D heads with real-time rendering speeds and accurate lip synchronization even with diverse audio inputs.

5.3. Qualitative Evaluation

In Fig. 5, we showcase results from self-driven and cross-driven experiments. We choose four key frames from each of the two experiment settings to compare the reconstruction quality and lip-sync accuracy. While 2D-based methods (Wav2Lip, PC-AVS) excel in lip synchronization, they for short of generating a faithful and consistent face when the head is rotated. AD-NeRF suffers from blurry reconstructions due to its lack of eye blink control. RAD-NeRF and ER-NeRF, while demonstrating improved facial consistency, can exhibit discrepancies in lip synchronization and fail to capture hair movement during head rotations.

In contrast, GaussianTalker generates photorealistic images with intricate details in non-rigid regions like eyes and wrinkles. Our spatial-audio attention module effectively disentangles audio-driven motions from scene variations, enabling precise control of mouth movements. This capability allows our model to capture hair movement realistically when the head rotates, leading to superior overall head reconstruction fidelity. In order to comprehensively visualize the efficacy of our proposed method, we provide the rendered videos in the supplementary file. The provided supplementary video demonstrates impressive lip synchronization capabilities and high fidelity head reconstruction with realistic motion.

5.4. Ablation Study

In this section, we provide ablation studies to validate the efficacy of the design choices of our model. We also show detailed visualizations of the generated results in the supplementary material for better comparison.

Table 3. Ablation study results comparing various attribute configurations for embedding canonical 3D Gaussian attributes.
Method PSNR \uparrow LPIPS \downarrow FID \downarrow LMD \downarrow Sync \uparrow
Ground Truth N/A 00 00 00 8.9358.9358.9358.935
s,r,SH,α𝑠𝑟𝑆𝐻𝛼s,r,SH,\alphaitalic_s , italic_r , italic_S italic_H , italic_α 33.19533.19533.19533.195 0.0160.0160.0160.016 9.9769.9769.9769.976 2.8732.8732.8732.873 6.9276.9276.9276.927
SH,α𝑆𝐻𝛼SH,\alphaitalic_S italic_H , italic_α 33.29933.29933.29933.299 0.0140.0140.0140.014 9.8089.8089.8089.808 2.8912.8912.8912.891 6.8536.8536.8536.853
r,s𝑟𝑠r,sitalic_r , italic_s 33.05633.05633.05633.056 0.0160.0160.0160.016 11.77511.77511.77511.775 2.8732.8732.8732.873 6.8926.8926.8926.892
random init. 33.04033.04033.04033.040 0.0170.0170.0170.017 11.91511.91511.91511.915 2.9962.9962.9962.996 6.5436.5436.5436.543
Table 4. Ablation study on selection of deformed attributes.
Method PSNR \uparrow LPIPS \downarrow FID \downarrow LMD \downarrow Sync \uparrow
Ground Truth N/A 00 00 00 8.9358.9358.9358.935
ΔSH,ΔαΔ𝑆𝐻Δ𝛼\Delta SH,\Delta\alpharoman_Δ italic_S italic_H , roman_Δ italic_α 32.74632.74632.74632.746 0.0210.0210.0210.021 44.93344.93344.93344.933 3.1793.1793.1793.179 6.6946.6946.6946.694
Δμ,Δr,ΔsΔ𝜇Δ𝑟Δ𝑠\Delta\mu,\Delta r,\Delta sroman_Δ italic_μ , roman_Δ italic_r , roman_Δ italic_s 33.03633.03633.03633.036 0.0130.0130.0130.013 17.5217.5217.5217.52 2.9702.9702.9702.970 6.6886.6886.6886.688
Δμ,Δr,Δs,ΔSHΔαΔ𝜇Δ𝑟Δ𝑠Δ𝑆𝐻Δ𝛼\Delta\mu,\Delta r,\Delta s,\Delta SH\Delta\alpharoman_Δ italic_μ , roman_Δ italic_r , roman_Δ italic_s , roman_Δ italic_S italic_H roman_Δ italic_α 33.29933.29933.29933.299 0.0130.0130.0130.013 9.8089.8089.8089.808 2.8902.8902.8902.890 6.9286.9286.9286.928
Table 5. Ablation study on augmented input conditions.
Method PSNR \uparrow LPIPS \downarrow FID \downarrow LMD \downarrow Sync \uparrow
Ground Truth N/A 00 00 00 8.9358.9358.9358.935
w/o null-vec 32.99732.99732.99732.997 0.0140.0140.0140.014 9.9089.9089.9089.908 2.9332.9332.9332.933 6.6986.698{6.698}6.698
w/o eye feature 32.82632.82632.82632.826 0.0150.0150.0150.015 10.06010.06010.06010.060 2.9022.9022.9022.902 6.9116.9116.9116.911
w/o viewpoint 31.86631.86631.86631.866 0.0190.0190.0190.019 13.23113.23113.23113.231 3.0523.0523.0523.052 6.5636.5636.5636.563
All (Ours) 33.29933.29933.29933.299 0.0140.0140.0140.014 9.8099.8099.8099.809 2.8912.8912.8912.891 6.9286.9286.9286.928
Table 6. Ablation study on the effectiveness of stage-wise training.
Method iter. PSNR \uparrow LPIPS \downarrow FID \downarrow LMD \downarrow Sync \uparrow
Ground Truth - N/A 00 00 00 8.9358.9358.9358.935
w/o stage-wise 500 26.06326.06326.06326.063 0.0720.0720.0720.072 66.62966.62966.62966.629 3.4463.4463.4463.446 1.3481.3481.3481.348
1000 26.47826.47826.47826.478 0.0640.0640.0640.064 56.89056.89056.89056.890 3.3443.3443.3443.344 4.0074.0074.0074.007
5000 32.67632.67632.67632.676 0.0160.0160.0160.016 14.02614.02614.02614.026 2.9712.9712.9712.971 6.6026.6026.6026.602
w/ stage-wise 500 31.07631.07631.07631.076 0.0290.0290.0290.029 31.30131.30131.30131.301 3.7923.7923.7923.792 1.5481.5481.5481.548
1000 31.92331.92331.92331.923 0.0240.0240.0240.024 20.36620.36620.36620.366 3.2453.2453.2453.245 4.4494.4494.4494.449
5000 32.73332.73332.73332.733 0.0140.0140.0140.014 11.17311.17311.17311.173 2.9232.9232.9232.923 6.7366.7366.7366.736

5.4.1. Attribute conditions for triplane

Our proposed triplane encodes the facial information of the canonical 3D head learned by 3D Gaussians. The mechanism also enforces spatial relationships between Gaussians for better deformation. In Tab. 4, we demonstrate the effectiveness of this approach by conducting quantitative ablation on the selection of attributes that are conditioned on the embedding f(μc)𝑓subscript𝜇𝑐f(\mu_{c})italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). We also provide results where all attributes are optimized separately following the original implementation, and the triplane is trained in the deformation stage. Utilizing only subsets of the Gaussian attributes show lower performance in lip synchronization and precision. Removing the attribute conditions during training leads to loss of spatial information embedded in the triplane embeddings, leading to a lack of facial cohesion during inference time.

5.4.2. Selection of deformed attributes.

A major challenge of manipulating the Gaussians is the magnitude of the parameters that need to be controlled. While estimating offsets for only a subset of attributes could reduce computational load, it may compromise overall fidelity due to the lack of control. To address this, in Tab. 4, we investigate different selections of Gaussian attributes for deformation. Controlling only SH𝑆𝐻SHitalic_S italic_H and α𝛼\alphaitalic_α makes the formulation similar to conditional NeRF-based works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023). Because 3DGS is an explicit representation that specifies the 3D positions and shapes, only controlling the appearance attributes leads to loss of overall fidelity. However, only controlling attributes that make up the position and shape of 3D Gaussians show lower reconstruction accuracy. Deformation of all Gaussian attribute is crucial for the highest fidelity and superior lip synchronization.

5.4.3. Disentanglement of audio-unrelated motion.

We also investigate the significance of using augmented conditions, such as eye blink, facial viewpoint, and null-vector. We evaluate the influence of additional conditions on image fidelity and lip synchronization by selectively removing them during training (Table 6). The lower reconstruction scores are attributed to the low lip-sync accuracy due to entanglement of verbal motion and scene variations unrelated to audio. In the supplementary material, we also visualize the attention scores of each comparison experiment for detailed analysis.

5.4.4. Stagewise optimization

In Fig. 6, we investigate the importance of employing a separate canonical stage. We opt to optimize the whole architecture by training each of the module simultaneously from scratch. While the final generated results show similar performance, optimizing the coarse facial geometry before training the deformation network results in faster optimization of the whole methodology.

6. Conclusion

In this work, we have proposed GaussianTalker, a novel framework for real-time pose-controllable 3D talking head synthesis, leveraging the 3D Gaussians for the head representation. Our method enables precise control over Gaussian primitives by conditioning features extracted from a multi-resolution triplane. Additionally, the integration of a spatial-audio cross-attention module facilitates the dynamic deformation of facial regions, allowing for nuanced adjustments based on audio cues and enhancing verbal motion disentanglement. Our method is distinguished from prior NeRF-based methods by its superior inference speed and high-fidelity results for out-of-domain audio tracks. The efficacy of our approach is validated by quantitative and qualitative analyses. We look forward to enriched user experiences, particularly in video game development, where real-time rendering capabilities of GaussianTalker promise to enhance interactive digital environments.

References

  • (1)
  • Athar et al. (2022) ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli Shechtman, and Zhixin Shu. 2022. Rignerf: Fully controllable neural 3d portraits. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 20364–20373.
  • Bulat and Tzimiropoulos (2017) Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision.
  • Cao and Johnson (2023) Ang Cao and Justin Johnson. 2023. HexPlane: A Fast Representation for Dynamic Scenes. CVPR (2023).
  • Cen et al. (2023) Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. 2023. Segment any 3d gaussians. arXiv preprint arXiv:2312.00860 (2023).
  • Chan et al. (2022) Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. 2022. Efficient Geometry-Aware 3D Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
  • Chen et al. (2019) Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7832–7841.
  • Chen et al. (2023) Yufan Chen, Lizhen Wang, Qi**g Li, Hongjiang ** Zhang, Hongxun Yao, and Yebin Liu. 2023. Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558 (2023).
  • Deng et al. (2019) Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
  • Dhamo et al. (2023) Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, and Eduardo Pérez-Pellitero. 2023. Headgas: Real-time animatable head avatars via 3d gaussian splatting. arXiv preprint arXiv:2312.02902 (2023).
  • Ekman and Friesen (1978) Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System: Manual. Palo Alto: Consulting Psychologists Press.
  • Fang et al. (2023) Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. 2023. Gaussianeditor: Editing 3d gaussians delicately with text instructions. arXiv preprint arXiv:2311.16037 (2023).
  • Fang et al. (2022) Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. 2022. Fast Dynamic Radiance Fields with Time-Aware Neural Voxels. In SIGGRAPH Asia 2022 Conference Papers.
  • Fridovich-Keil et al. (2023) Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, Benjamin Recht, and Angjoo Kanazawa. 2023. K-Planes: Explicit Radiance Fields in Space, Time, and Appearance. arXiv preprint arXiv:2301.10241 (2023).
  • Gao et al. (2024) Lin Gao, Jie Yang, Bo-Tao Zhang, Jia-Mu Sun, Yu-Jie Yuan, Hongbo Fu, and Yu-Kun Lai. 2024. Mesh-based Gaussian Splatting for Real-time Large-scale Deformation. arXiv preprint arXiv:2402.04796 (2024).
  • Gao et al. (2022) Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juyong Zhang. 2022. Reconstructing personalized semantic facial nerf models from monocular video. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–12.
  • Grassal et al. (2022) Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. 2022. Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18653–18664.
  • Guo et al. (2021) Yudong Guo, Keyu Chen, Sen Liang, Yong-** Liu, Hujun Bao, and Juyong Zhang. 2021. AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5784–5794.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS.
  • Hu and Liu (2024) Shoukang Hu and Ziwei Liu. 2024. GauHuman: Articulated Gaussian Splatting from Monocular Human Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Huang et al. (2020) Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. 2020. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition. In CVPR.
  • Jamaludin et al. (2019) Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You Said That?: Synthesising Talking Faces from Audio. International Journal of Computer Vision 127 (2019), 1767–1779.
  • Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG) 42, 4 (2023), 1–14.
  • Khakhulin et al. (2022) Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. 2022. Realistic one-shot mesh-based head avatars. In European Conference on Computer Vision. Springer, 345–362.
  • Li et al. (2023) Jiahe Li, Jiawei Zhang, Xiao Bai, Jun Zhou, and Lin Gu. 2023. Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis. arXiv preprint arXiv:2307.09323 (2023).
  • Li et al. (2017) Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning a Model of Facial Shape and Expression from 4D Scans. ACM Trans. Graph. 36, 6, Article 194 (nov 2017), 17 pages.
  • Li et al. (2024) Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. 2024. Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  • Liu et al. (2022) Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. 2022. Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII. Springer, 106–125.
  • Liu et al. (2023) Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, and Haoqian Wang. 2023. Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars. arXiv preprint arXiv:2311.16482 (2023).
  • Lu et al. (2021) Yuanxun Lu, **xiang Chai, and Xun Cao. 2021. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation. ACM Trans. Graph. 40, 6, Article 220 (dec 2021), 17 pages. https://doi.org/10.1145/3478513.3480484
  • Luiten et al. (2023) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2023. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. arXiv:2308.09713 [cs.CV]
  • Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
  • Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
  • Prajwal et al. (2020) KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.
  • Qian et al. (2023) Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2023. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069 (2023).
  • Shen et al. (2022) Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. 2022. Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 666–682.
  • Song et al. (2022) Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2022. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security 17 (2022), 585–598.
  • Sun et al. (2021) Yasheng Sun, Hang Zhou, Ziwei Liu, and Hideki Koike. 2021. Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation.. In IJCAI, Vol. 2. 4.
  • Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13.
  • Tang et al. (2022) Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, **gtuo Liu, Gang Zeng, and **gdong Wang. 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368 (2022).
  • Thies et al. (2020) Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-Driven Facial Reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 716–731.
  • Wang et al. (2024) Jie Wang, Jiu-Cheng Xie, Xianyan Li, Feng Xu, Chi-Man Pun, and Hao Gao. 2024. GaussianHead: High-fidelity Head Avatars with Learnable Gaussian Derivation. arXiv:2312.01632 [cs.CV]
  • Wang et al. (2020) Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI. Springer, 700–717.
  • Wiles et al. (2018) Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. 2018. X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes. In Computer Vision–ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII 15. Springer, 690–706.
  • Wu et al. (2023) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2023. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering. arXiv:2310.08528 [cs.CV]
  • Yang et al. (2023a) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang **. 2023a. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. arXiv:2309.13101 [cs.CV]
  • Yang et al. (2023b) Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. 2023b. Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting. arXiv:2310.10642 [cs.CV]
  • Yao et al. (2022) Shunyu Yao, RuiZhe Zhong, Yichao Yan, Guangtao Zhai, and Xiaokang Yang. 2022. DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering. arXiv preprint arXiv:2201.00791 (2022).
  • Ye et al. (2023) Zhenhui Ye, **zheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, **glin Liu, Yi Ren, Xiang Yin, Zejun Ma, and Zhou Zhao. 2023. GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation. arXiv preprint arXiv:2305.00787 (2023).
  • Ye et al. (2022) Zhenhui Ye, Ziyue Jiang, Yi Ren, **glin Liu, **zheng He, and Zhou Zhao. 2022. GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. In The Eleventh International Conference on Learning Representations.
  • Yifan et al. (2019) Wang Yifan, Felice Serena, Shihao Wu, Cengiz Öztireli, and Olga Sorkine-Hornung. 2019. Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
  • Yin et al. (2022) Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. 2022. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In European conference on computer vision. Springer, 85–101.
  • Yu et al. (2020) Lingyun Yu, Jun Yu, Mengyan Li, and Qiang Ling. 2020. Multimodal inputs driven talking face generation with spatial–temporal dependency. IEEE Transactions on Circuits and Systems for Video Technology 31, 1 (2020), 203–216.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
  • Zhang et al. (2023) Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661.
  • Zhang et al. (2021) Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zheng et al. (2022) Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, Xu Chen, Michael J Black, and Otmar Hilliges. 2022. Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13545–13555.
  • Zhou et al. (2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4176–4186.
  • Zhou et al. (2020) Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) 39, 6 (2020), 1–15.
  • Zwicker et al. (2001) Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. 2001. Surface splatting. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 371–378.

Appendix

In the following, we describe the implementation details and further analyses of GaussianTalker. Specifically, we first introduce the details of our network design and hyperparameter settings in Sec. A. We also provide details of our analysis on the proposed method that was conducted in the main paper in Sec. B. In Sec. C, we validate our methodology with more qualitative results from our experiments, and also conduct a user study. Then, more ablation studies are conducted in Sec. D. To further demonstrate the robustness and effectiveness of our framework, we also provide a supplementary video (Sec. E). Finally, we discuss the limitations and ethical considerations of our research in Sec. F.

Appendix A Implementation Details

A.1. Network architecture

A.1.1. Multi-resolution Triplane.

Our multi-resolution triplane consists of three orthogonal grids, with the hidden feature dimension of H=64𝐻64H=64italic_H = 64, and its base resolution of R=64𝑅64R=64italic_R = 64, which is further upsampled by 2.

A.1.2. Canonical 3D Gaussian attribute predictor.

The employed network that predicts the attributes of canoncial 3D Gaussians is made up of MLPs, such as: can={ϕshared,ϕr,ϕs,ϕSH,ϕα}subscriptcansubscriptitalic-ϕsharedsubscriptitalic-ϕ𝑟subscriptitalic-ϕ𝑠subscriptitalic-ϕ𝑆𝐻subscriptitalic-ϕ𝛼\mathcal{F}_{\mathrm{can}}=\{\phi_{\mathrm{shared}},\phi_{r},\phi_{s},\phi_{SH% },\phi_{\alpha}\}caligraphic_F start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT = { italic_ϕ start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT , italic_ϕ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT }. Specifically, a tiny MLP ϕsharedsubscriptitalic-ϕshared\phi_{\mathrm{shared}}italic_ϕ start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT encodes the triplane embedding f(μc)𝑓subscript𝜇𝑐f(\mu_{c})italic_f ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) and outputs a shared feature κ𝜅\kappaitalic_κ for all attributes. The following MLP regressors maps this feature to each 3D Gaussian attribute such as:

(17) κ=ϕshared(f(μ)),rc=ϕr(κ),sc=ϕs(κ),SHc=ϕSH(κ),αc=ϕα(κ).formulae-sequence𝜅subscriptitalic-ϕshared𝑓𝜇formulae-sequencesubscript𝑟𝑐subscriptitalic-ϕ𝑟𝜅formulae-sequencesubscript𝑠𝑐subscriptitalic-ϕ𝑠𝜅formulae-sequence𝑆subscript𝐻𝑐subscriptitalic-ϕ𝑆𝐻𝜅subscript𝛼𝑐subscriptitalic-ϕ𝛼𝜅\begin{gathered}\kappa=\phi_{\mathrm{shared}}(f(\mu)),\\ r_{c}=\phi_{r}(\kappa),\;s_{c}=\phi_{s}(\kappa),\;SH_{c}=\phi_{SH}(\kappa),\;% \alpha_{c}=\phi_{\alpha}(\kappa).\end{gathered}start_ROW start_CELL italic_κ = italic_ϕ start_POSTSUBSCRIPT roman_shared end_POSTSUBSCRIPT ( italic_f ( italic_μ ) ) , end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_κ ) , italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_κ ) , italic_S italic_H start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT ( italic_κ ) , italic_α start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_ϕ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_κ ) . end_CELL end_ROW

A.1.3. Deformation offset predictor.

Similar to cansubscriptcan\mathcal{F}_{\mathrm{can}}caligraphic_F start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT, the deformation prediction network, can={ψμ,ψr,ψs,ψSH,ψα}subscriptcansubscript𝜓𝜇subscript𝜓𝑟subscript𝜓𝑠subscript𝜓𝑆𝐻subscript𝜓𝛼\mathcal{F}_{\mathrm{can}}=\{\psi_{\mu},\psi_{r},\psi_{s},\psi_{SH},\psi_{% \alpha}\}caligraphic_F start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT = { italic_ψ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT }, that estimates the deformation offsets of each Gaussian attribute for each frame consists of several small MLP regressors. For the n𝑛nitalic_n-th frame, the final output embedding from the cross-attention module, znLsubscriptsuperscript𝑧𝐿𝑛z^{L}_{n}italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, is mapped to each attribute offset such that

(18) Δμn=ψμ(znL),Δrn=ψr(znL),Δsn=ψs(znL),ΔSHn=ψSH(znL),Δαn=ψα(znL).formulae-sequenceΔsubscript𝜇𝑛subscript𝜓𝜇subscriptsuperscript𝑧𝐿𝑛formulae-sequenceΔsubscript𝑟𝑛subscript𝜓𝑟subscriptsuperscript𝑧𝐿𝑛formulae-sequenceΔsubscript𝑠𝑛subscript𝜓𝑠subscriptsuperscript𝑧𝐿𝑛formulae-sequenceΔ𝑆subscript𝐻𝑛subscript𝜓𝑆𝐻subscriptsuperscript𝑧𝐿𝑛Δsubscript𝛼𝑛subscript𝜓𝛼subscriptsuperscript𝑧𝐿𝑛\begin{gathered}\Delta\mu_{n}=\psi_{\mu}(z^{L}_{n}),\;\Delta{r}_{n}=\psi_{r}(z% ^{L}_{n}),\;\Delta{s}_{n}=\psi_{s}(z^{L}_{n}),\\ \Delta{SH}_{n}=\psi_{SH}(z^{L}_{n}),\;\Delta{\alpha}_{n}=\psi_{\alpha}(z^{L}_{% n}).\end{gathered}start_ROW start_CELL roman_Δ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , roman_Δ italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , roman_Δ italic_s start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL roman_Δ italic_S italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_S italic_H end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , roman_Δ italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_ψ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) . end_CELL end_ROW

A.2. Hyperparameter Configuration

During the canonical stage, we conduct training over 8,00080008,0008 , 000 iterations for a specific identity. We set the weights for the loss functions as follows: λ1=0.8subscript𝜆10.8\lambda_{1}=0.8italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.8, λlpips=0.01subscript𝜆lpips0.01\lambda_{\mathrm{lpips}}=0.01italic_λ start_POSTSUBSCRIPT roman_lpips end_POSTSUBSCRIPT = 0.01, and λDSSIM=0.2subscript𝜆DSSIM0.2\lambda_{\mathrm{D-SSIM}}=0.2italic_λ start_POSTSUBSCRIPT roman_D - roman_SSIM end_POSTSUBSCRIPT = 0.2. The initial learning rate for the multi-resolution triplane is set to 0.0016, gradually decaying to 0.00016. Similarly, the learning rate for cansubscriptcan\mathcal{F}_{\mathrm{can}}caligraphic_F start_POSTSUBSCRIPT roman_can end_POSTSUBSCRIPT starts at 0.0001 and diminishes to 0.00001. We cap the maximum number of 3D Gaussians at 50,000, and we abstain from utilizing the opacity reset operation from the original implementation (Kerbl et al., 2023), as we found it does not yield discernible benefits in our experiments.

Subsequently, in the deformation stage, we proceed with training the network for 8,000 iterations. We maintain the same weighting scheme for the loss functions: λ1=0.8subscript𝜆10.8\lambda_{1}=0.8italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.8, λlpips=0.01subscript𝜆lpips0.01\lambda_{\mathrm{lpips}}=0.01italic_λ start_POSTSUBSCRIPT roman_lpips end_POSTSUBSCRIPT = 0.01, λDSSIM=0.2subscript𝜆DSSIM0.2\lambda_{\mathrm{D-SSIM}}=0.2italic_λ start_POSTSUBSCRIPT roman_D - roman_SSIM end_POSTSUBSCRIPT = 0.2, and λlip=0.8subscript𝜆lip0.8\lambda_{\mathrm{lip}}=0.8italic_λ start_POSTSUBSCRIPT roman_lip end_POSTSUBSCRIPT = 0.8. All modules are trained with an initial learning rate of 0.0001, gradually decreasing to 0.00001.

While our spatial-audio cross-attention module primarily employs L=2𝐿2L=2italic_L = 2 cross-attention layers, our modified GaussianTalker with L=1𝐿1L=1italic_L = 1 can achieve comparable results with even faster inference speeds.

Appendix B Additional Analysis

B.1. Splatting on the background image

Initially, our research followed the method outlined in the original implementation (Kerbl et al., 2023), where faces were generated on a white background. However, we encountered limitations with this approach. To render images containing only faces on a white background, corresponding ground truth images with similar characteristics were required, necessitating the use of a segmentation model. However, due to the inherent inaccuracies of the segmentation model, the obtained facial masks tended to encompass larger areas, including the background. Additionally, the disproportionate emphasis of loss terms such as SSIM and perceptual loss on imperfect facial contours relative to mouth and eye movements hindered the learning process.

As a solution, we opted to generate faces against GT backgrounds instead. This approach allowed for the accurate learning of Gaussian presence boundaries by distributing loss across the entire image. Similar to preprocessing techniques employed in previous NeRF-based works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023), we interpolated the human form from the background image to create an image with the person removed. Subsequently, faces were directly rendered using Gaussian methods, enabling comparisons with GT videos. By adopting this strategy, our GaussianTalker is trained without the need for facial mask, facilitating the faithful representation of intricate details such as hair.

B.2. Visualization of Attention

In our spatial-audio cross-attention module, the computation of the attention score is formalized by the following equation:

(19) (An)l=softmax(qkn)ldk,superscriptsubscriptA𝑛𝑙softmaxsuperscript𝑞subscriptsuperscript𝑘𝑛𝑙subscript𝑑𝑘(\mathrm{A}_{n})^{l}=\frac{\mathrm{softmax}(qk^{\intercal}_{n})^{l}}{\sqrt{d_{% k}}},( roman_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = divide start_ARG roman_softmax ( italic_q italic_k start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ,

where l𝑙litalic_l denotes the index of {an,en,υn,}subscript𝑎𝑛subscript𝑒𝑛subscript𝜐𝑛\{a_{n},e_{n},\upsilon_{n},\emptyset\}{ italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_υ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ∅ } and (An)lsuperscriptsubscriptA𝑛𝑙(\mathrm{A}_{n})^{l}( roman_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT corresponds to its calculated attention score. AnsubscriptA𝑛\mathrm{A}_{n}roman_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the concatenation of all (An)lsuperscriptsubscriptA𝑛𝑙(\mathrm{A}_{n})^{l}( roman_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, resulting in a shape of B×H×N×dk𝐵𝐻𝑁subscript𝑑𝑘{{B}\times{H}\times{N}\times{d_{k}}}italic_B × italic_H × italic_N × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which respectively indicate batch size, number of heads, number of Gaussians, and number of features per Gaussian.

For each attention score (An)lsuperscriptsubscriptA𝑛𝑙(\mathrm{A}_{n})^{l}( roman_A start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, we visualize the attention by assigning the score to RGB values. Thereby we obtain attention visualization colors cattsubscript𝑐𝑎𝑡𝑡c_{att}italic_c start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT for each Gaussian. The overall visualization of attention is then calculated such as:

(20) C=i=1ciαij=1i1(1αj),𝐶subscript𝑖1subscript𝑐𝑖subscriptsuperscript𝛼𝑖superscriptsubscriptproduct𝑗1𝑖11subscriptsuperscript𝛼𝑗C=\sum_{i=1}c_{i}{\alpha}^{\prime}_{i}\prod_{j=1}^{i-1}(1-{\alpha}^{\prime}_{j% }),italic_C = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( 1 - italic_α start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the color associated with each Gaussian, determined by cattsubscript𝑐𝑎𝑡𝑡c_{att}italic_c start_POSTSUBSCRIPT italic_a italic_t italic_t end_POSTSUBSCRIPT along the view direction. αisuperscriptsubscript𝛼𝑖{\alpha}_{i}^{\prime}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is derived from the multiplication of the opacity α𝛼\alphaitalic_α of the 3D Gaussian and its projected covariance ΣsuperscriptΣ\Sigma^{\prime}roman_Σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This mathematical formulation allows us to visually interpret the model’s focus within the generated representations, effectively highlighting the areas of greatest feature impact.

B.3. Visualization of triplane

Fig. 3 of the main paper visualizes the PCA analysis result of our multi-resolution triplane, showing the efficacy of using triplane to embed Gaussian features. We perform PCA on each triplane with dimensions H×R×R𝐻𝑅𝑅{H\times R\times R}italic_H × italic_R × italic_R, linearly transforming the first dimension down to three principal components, resulting in dimensions 3×R×R3𝑅𝑅{3\times R\times R}3 × italic_R × italic_R. Subsequently, the values of the first dimension are normalized between [0,255]0255[0,255][ 0 , 255 ] to denote RGB values. As a result, in all xy, yz, and zx triplanes, semantically close facial regions are consistently represented with similar colorations.

Appendix C Additional Experiments

C.1. Additional qualitative experiments.

We present additional visualization of generated keyframes from comparison experiments in the self-driven setting and the cross-driven setting in Fig. 6 and Fig. 7 respectively. These experiments showcase the stability of our method and its applicability to various identities.

C.2. User study

Following previous works (Tang et al., 2022; Li et al., 2023), we conducted a user study in order to better judge the visual quality of the generated talking head videos. 21 participants with an age range of 20-40 years old were solicited to evaluate the rendered results in the head reconstruction setting. For accurate judgments, we combine all generated videos into a single high-resolution video, enabling simultaneous observation of all movements by the participants. To ensure fairness in the comparison process, we assign a number to each generated result instead of identifying them by their method. Participants were asked to evaluate the three perspectives of the generated portraits: (1) Lip-sync Accuracy; (2) Video Realness; and (3) Image Quality. The results are shown in Tab. 7.

of be require long up quality
GT Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Wav2Lip Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
PC-AVS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
AD-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RAD-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ER-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 6. More results comparison on the self-driven setting.

family skill result help my result
GT Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Wav2Lip Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
PC-AVS Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
AD-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
RAD-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
ER-NeRF Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Ours Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 7. More results comparison on the cross-driven setting.

Table 7. User study results. The rating is of scale 1-5, the higher the better. The top, second-best, and third-best results are shown in red, orange, and yellow, respectively.
self-driven cross-driven
Methods Lip-sync Accuracy Image Quality Video Realness Lip-sync Accuracy Image Quality Video Realness
Wav2Lip (Prajwal et al., 2020) 3.1673.1673.1673.167 2.6652.6652.6652.665 2.4592.4592.4592.459 2.6782.6782.6782.678 2.3132.3132.3132.313 2.1352.1352.1352.135
PC-AVS (Zhou et al., 2021) 2.6252.6252.6252.625 1.8961.8961.8961.896 1.9211.9211.9211.921 1.9581.9581.9581.958 1.2921.2921.2921.292 1.2291.2291.2291.229
AD-NeRF (Guo et al., 2021) 2.0312.0312.0312.031 2.4922.4922.4922.492 2.3962.3962.3962.396 2.5742.5742.5742.574 3.0423.0423.0423.042 2.3652.3652.3652.365
RAD-NeRF (Tang et al., 2022) 2.4172.4172.4172.417 2.7502.7502.7502.750 2.5412.5412.5412.541 2.9382.9382.9382.938 3.1463.1463.1463.146 2.6042.6042.6042.604
ER-NeRF (Li et al., 2023) 2.3542.3542.3542.354 3.0423.0423.0423.042 2.7712.7712.7712.771 2.7922.7922.7922.792 3.4583.4583.4583.458 3.1463.1463.1463.146
GaussianTalker 3.0833.0833.0833.083 3.6673.6673.6673.667 3.1883.1883.1883.188 3.2503.2503.2503.250 3.7293.7293.7293.729 3.2083.2083.2083.208

Appendix D Ablation studies

D.1. Initialization of μcsubscript𝜇𝑐\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

Our study explores the impact of initialization on canonical 3D Gaussian optimization. In the default setting, we leverage a pre-optimized Basel Face Model (Deng et al., 2019) to obtain camera parameters during preprocessing. These optimized mesh vertices are used to initialize the 3D positions, μcsubscript𝜇𝑐\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of the 3D Gaussians.

To investigate the impact of the proposed 3DMM-based initialization, we conduct an ablation study by comparing it to random initialization from a sphere. In Fig. 8, we visually analyze the optimization process of the canonical stage under both initialization settings. Our experiments demonstrate that utilizing 3DMM-based initialization leads to faster convergence, attributed to the facial depth information encoded in the initialized points.

D.2. Selection of attributes inferred for triplane embeddings.

In Fig. 9, we support the quantitative comparison in the main paper by presenting key frames of the rendered results. Conditioning the triplane embeddings on the structure information such as r𝑟ritalic_r and s𝑠sitalic_s tends to show less accurate facial details such as wrinkles in facial muscle. In contrast, while conditioning on appearance information SH𝑆𝐻SHitalic_S italic_H and α𝛼\alphaitalic_α produce accurate reconstructions of the canonical head, the facial motion appears less dynamic compared to the ground truth, and does not correlate well with input speech audio.

D.3. Selection of deformed attributes.

We also provide qualitative comparisons from our ablation study on selection of Gaussian attributes to be deformed. Utilizing the same comparison settings from Sec.5.4.2, we visualize the rendered results in Fig. 10. Only deforming SH𝑆𝐻SHitalic_S italic_H and α𝛼\alphaitalic_α show blurry results with unrealistic deformations, while only manipulating

D.4. Disentanglement of audio-unrelated motion

Finally, we reinforce the insights drawn from the quantitative analysis in Section 5.4.3. We elucidate the disentanglement of speech-related motion in Fig. 11 by presenting visualizations of the attention scores for the input conditions across the ablation experiment settings. Notably, the attention scores of the input speech audio become more widely distributed across other facial regions, indicating inadequate disentanglement of speech-related motion when solely provided with speech as the input condition.

Appendix E Supplementary Video

To comprehensively visualize the efficacy of our proposed method in the domain of talking facial video synthesis, we prepared a supplementary video. This video encompasses the results and analysis of our experiments presented in the main paper and the supplementary document. We showcase talking head videos generated under both the self-driven and cross-driven settings and compare them with previous NeRF-based works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023). We also demonstrate the effectiveness of our spatial-audio cross attention module by showing how the attention scores of each condition evolve as the scene progresses. Lastly, the video includes a set of ablation studies that systematically examine the impact of each component of our proposed method.

Appendix F Further Discussions

F.1. Ethical Considerations

Our goal with GaussianTalker is to create realistic talking 3D heads for practical real-world applications like digital assistants and video production. However, its photorealism raises ethical concerns, as it’s difficult to distinguish real from synthetic videos. This can be used to create deepfakes, which are manipulated videos that can be used to spread misinformation or damage someone’s reputation. To address this, we propose several measures: 1) informing users about video authenticity, 2) sharing our results with deepfake detection communities to improve detection algorithms, and 3) advocating for digital watermarks in real videos to deter misuse. Finally, we believe responsible use requires clear regulations to govern deepfakes on social media, protecting users from potential manipulation.

F.2. Limitations and future work

GaussianTalker shares a common limitation with previous NeRF-based talking head synthesis methods: per-identity training. This restricts the model’s ability to generalize to new identities, making data preparation for audio and eye features time-consuming. Additionally, free-viewpoint rendering remains a challenge due to the lack of multi-view training data. While the deformation stage achieves high fidelity and generalizes well to out-of-domain audio, it struggles with extreme viewpoints. Our current approach uses limited canonical training for coarse structure, leading to inconsistencies when synthesizing from very different angles.

Future work will focus on overcoming these limitations. We aim to explore techniques for multi-identity training and efficient data pre-processing. Additionally, we will investigate methods for free-viewpoint rendering using techniques like multi-view data acquisition or neural rendering approaches.

# iter. initialization 100 200 300 500 1000
random init. Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
3DMM vertices init. Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 8. Ablation study on initialization of the canonical position μcsubscript𝜇𝑐\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. We evaluate the effectiveness of the 3DMM-based initalization by visualizing the optimization process of the reconstructed canonical 3D head, and compare it to random initialization. Our experiments demonstrate that utilizing 3DMM-based initialization leverages the depth information of the human face, leading to significantly faster convergence. In contrast, optimizing from randomly sampled points prolongs training duration and fails to completely resolve artifacts, particularly around the eyes and hair regions.

paper up quality
Ground Truth Refer to caption Refer to caption Refer to caption
only r,s𝑟𝑠r,sitalic_r , italic_s Refer to caption Refer to caption Refer to caption
only SH,α𝑆𝐻𝛼SH,\alphaitalic_S italic_H , italic_α Refer to caption Refer to caption Refer to caption
All (Ours) Refer to caption Refer to caption Refer to caption
Figure 9. Ablation study on the selection of attributes inferred from the triplane embedding f(μ)𝑓𝜇f(\mu)italic_f ( italic_μ ). We compare the generated results from

ever wherever work opportunity
Ground Truth Refer to caption Refer to caption Refer to caption Refer to caption
ΔSH,ΔαΔ𝑆𝐻Δ𝛼\Delta SH,\Delta\alpharoman_Δ italic_S italic_H , roman_Δ italic_α Refer to caption Refer to caption Refer to caption Refer to caption
Δμ,Δr,ΔsΔ𝜇Δ𝑟Δ𝑠\Delta\mu,\Delta r,\Delta sroman_Δ italic_μ , roman_Δ italic_r , roman_Δ italic_s Refer to caption Refer to caption Refer to caption Refer to caption
All (Ours) Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10. Deforming only spherical harmonics and opacity resulted in a significant loss of facial detail and blurry reconstructions. Notably, this led to unrealistic deformations in lip regions, where the lips and teeth appeared merged. Conversely, deforming only structural information (μ,r,s𝜇𝑟𝑠\mu,r,sitalic_μ , italic_r , italic_s) produced much less dynamic lip movements. In addition, the generated results show the inside of the mouth, such as teeth and tongue less frequently.

rendered audio eye viewpoint null-vec
w/o eye feature ensubscript𝑒𝑛e_{n}italic_e start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o viewpoint υnsubscript𝜐𝑛\upsilon_{n}italic_υ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
w/o null-vector \emptyset Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
All (Ours) Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11. Ablation study on disentanglement effect of each input conditions. We assess the effectiveness of each input condition by alternatively turning them on and off, and visualizing the attention scores of each condition.