GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

Kyusun Cho¹, Joungbin Lee¹, Heeji Yoon¹, Yeobin Hong¹, Jaehoon Ko¹,
Sangjun Ahn², and Seungryong Kim¹ ¹ Korea University ² NCSOFThttps://ku-cvlab.github.io/GaussianTalker/

Abstract.

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker’s superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed of 120 FPS, surpassing previous benchmarks.

Talking Head Generation, 3D Controllable Head, 3D Gaussian Splatting

^†^†ccs: Computing methodologies Reconstruction^†^†ccs: Information systems Multimedia content creation^†^†ccs: Computing methodologies 3D imaging

[Uncaptioned image] — Figure 1. Fidelity and inference time comparison between existing 3D talking face synthesis models (Guo et al., 2021; Tang et al., 2022; Li et al., 2023) and ours. Our method, GaussianTalker, achieves on par with or better results at much higher FPS. Note that we also include GaussianTalker^∗, a more efficient and faster variant. Size of each bubble represents the inference time of each method.

1. Introduction

Generating a talking head video driven by arbitrary speech audio is a popular task that has various uses, including the generation of digital humans, virtual avatars, movie production, and teleconferencing (Wiles et al., 2018; Chen et al., 2019; Jamaludin et al., 2019; Prajwal et al., 2020; Zhang et al., 2023; Suwajanakorn et al., 2017; Thies et al., 2020; Song et al., 2022). While various works (Wiles et al., 2018; Chen et al., 2019; Jamaludin et al., 2019; Prajwal et al., 2020) have successfully attempted to solve this task using generative models, they do not focus on controlling head poses, limiting their realism and applicability. Recently, numerous studies (Guo et al., 2021; Liu et al., 2022; Ye et al., 2022, 2023; Tang et al., 2022; Li et al., 2023) have applied neural radiance fields (NeRF) (Mildenhall et al., 2020) for the creation of pose controllable talking portraits. By directly conditioning audio features in the multi-layer perceptron (MLP) of NeRF, these methods can synthesize view-consistent 3D head structure with its lips synced to the input audio. Although these NeRF-based techniques achieve high-quality and consistent visual outputs, their slow inference speed limits their practicality. Despite recent advancements (Tang et al., 2022; Li et al., 2023) achieving rendering speeds up to 30 frames per second (fps) at $512\times 512$ resolution, computational bottlenecks must be overcome to be applied in real-world scenarios.

Addressing this limitation, an intuitive solution is to leverage the fast rendering capabilities of 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023). Recently recognized as a viable alternative to NeRF, 3DGS offers comparable rendering quality while significantly improving inference speeds. Although 3DGS was initially proposed for reconstructing static 3D scenes, subsequent works have extended it to dynamic scenes (Wu et al., 2023; Yang et al., 2023b; Luiten et al., 2023; Yang et al., 2023a). However, there has been little research on leveraging 3DGS to create dynamic 3D scenes with controllable inputs, most of which focused on using an intermediate mesh representation to drive the 3D Gaussians (Chen et al., 2023; Qian et al., 2023; Hu and Liu, 2024; Liu et al., 2023; Li et al., 2024). However, relying on an intermediate 3D mesh representation, such as FLAME (Li et al., 2017), for deformation often lacks fine details in hair and facial wrinkles.

We identify two major challenges in directly map** the speech audio to the deformation of 3D Gaussians. First, the 3DGS representation lacks shared spatial information among the adjacent points, complicating its manipulation. The optimization process of 3DGS does not consider relationships between neighboring Gaussians, crucial for maintaining facial region cohesion during deformation. Secondly, the extensive parameter space and a substantial number of Gaussians pose a challenge to their manipulation. Unlike controllable NeRF representations where the position and the number of sampling points are fixed, the position, shape, and appearance attributes of numerous Gaussian points need to be deformed per frame, while also preserving the intricate facial details.

In this paper, we present GaussianTalker, a novel framework for real-time pose-controllable talking head synthesis. For the first time, we leverage the 3D Gaussian representation to exploit its fast scene modeling capability for audio-driven dynamic facial animation. We construct a static 3DGS representation of the canonical head shape and deform this in sync with the audio. Specifically, we employ a multi-resolution triplane to extract feature embeddings for each 3D Gaussian position, from which each Gaussian attribute is directly estimated. This design ensures that the triplane learns the spatial and semantic information of the 3D head, while the interpolation mechanism of the 2D feature grids efficiently enforces interactions between neighboring points. The feature embeddings are subsequently fed to the proposed spatial-audio attention module, where they are merged with the audio features to predict the frame-wise offsets for the attributes of each Gaussian. This module successfully models the relevance between audio features and the motions for each Gaussian primitive. The cross attention offers a more stable approach of manipulating the substantial number of Gaussians and their intricate parameter space, compared to concatenation (Guo et al., 2021; Tang et al., 2022) or multiplication (Li et al., 2023) as in previous works. Qualitative and quantitative experiments demonstrate GaussianTalker’s superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Additionally, we conduct ablation studies to verify the effectiveness of individual design choices within our model.

Our main contributions are summarized as follows:

•

For the first time, we present a novel audio-conditioned 3D Gaussian Splatting framework real-time 3D-aware talking head synthesis.
•

We reformulate the 3D Gaussian representation with a feature volume representation in order to enforce spatial consistency among adjacent Gaussians.
•

We integrate cross-attention mechanisms between audio and spatial features to improve stability and ensure region-specific deformation across a significant number of Gaussians.

2. Related Work

2.1. Audio-driven talking portrait synthesis

Audio-driven talking portrait synthesis aims to create realistic facial animations with accurate lip movements based on audio input. Early 2D GAN-based methods (Prajwal et al., 2020; Yu et al., 2020; Yin et al., 2022; Zhou et al., 2020; Sun et al., 2021) achieved photorealism but lacked control over head pose due to the absence of 3D geometry. In order to control the head poses, some works (Thies et al., 2020; Wang et al., 2020; Lu et al., 2021; Zhang et al., 2023) utilize model-based methods, where facial landmarks and 3D morphable models reinforce the lip sync model with the ability to adjust the orientation of the head. However, these approaches lead to new problems such as extra errors from the intermediate representations, and inaccuracies in identity preservation and realism.

Recently, Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) have been explored for talking portraits due to their ability to capture complex scenes. AD-NeRF (Guo et al., 2021) pioneered using NeRF’s implicit representation for conditional audio input, but separate networks for head and torso limited its flexibility. Subsequent NeRF-based methods (Yao et al., 2022; Shen et al., 2022; Liu et al., 2022) achieved high quality but suffered from slow rendering speeds. While RAD-NeRF (Tang et al., 2022) and ER-NeRF (Li et al., 2023) improved efficiency and quality with grid-based NeRF (Müller et al., 2022), real-time rendering of pose-controllable 3D talking head remains challenging.

2.2. 3D Gaussian splatting

3DGS (Kerbl et al., 2023) is a pioneering technique in point cloud rendering that utilizes a multitude of ellipsoidal, anisotropic balls to precisely represent a scene. Each point embodies a 3D Gaussian distribution, with its mean, covariance, opacity, and spherical harmonics parameters optimized to accurately capture the scene’s shapes and appearances. This approach effectively resolves common issues in point rendering, such as output gaps. Furthermore, combined with a tile-based rasterization algorithm, it facilitates expedited training and real-time rendering capabilities. Recently, 3DGS has gained widespread application in 3D vision tasks such as object manipulation (Fang et al., 2023; Gao et al., 2024), reconstruction (Kerbl et al., 2023; Fang et al., 2022), and perception (Cen et al., 2023; Luiten et al., 2023) within 3D environments.

Refer to caption — Figure 2. Overview of our GaussianTalker framework. GaussianTalker utilizes a multi-resolution triplane to leverage different scales of features depicting a canonical 3D head. These features are fed into a spatial-audio attention module along with the audio feature to predict per-frame deformations, enabling fast and reliable talking head synthesis.

2.3. Facial animation with 3DGS

Previous methods for facial reconstrution and animation primarily relied on 3D Morphable Models(3DMM) (Grassal et al., 2022; Khakhulin et al., 2022) or utilized neural implicit representations (Zheng et al., 2022; Athar et al., 2022; Gao et al., 2022). Recent approaches (Qian et al., 2023; Wang et al., 2024; Chen et al., 2023; Dhamo et al., 2023) have shifted towards adopting the 3DGS representation, aiming to leverage the benefits of rapid training and rendering while still achieving competitive levels of photorealism. GaussianAvatars (Qian et al., 2023) reconstructed head avatars by rigging 3D Gaussians on FLAME (Li et al., 2017) mesh. MonoGaussianAvatar (Chen et al., 2023) learned explicit head avatars by shifting the mean position of 3D Gaussians from canonical to deformed space using Linear Blend Skinning (LBS) and simultaneously adjusts other Gaussian parameters through a deformation field. GaussianHead (Wang et al., 2024) adopted a motion deformation field to adapt to facial movements while preserving head geometry and separately utilized a tri-plane to retain the appearance information of individual 3D Gaussians. However, the aforementioned methods tend to depend on parametric models for facial animation. In contrast to previous works, our audio-driven method is not only free from the need for data beyond the speech sequence for facial reenactment but also is readily applicable to novel audio.

3. Preliminary: 3D Gaussian Splatting

3D Gaussian splatting (3DGS) (Kerbl et al., 2023) employs anisotropioc 3D Gaussians as geometric primitives for learning an explicit 3D representation. Each 3D Gaussian is defined by a center mean $\mu\in\mathbb{R}^{3}$ and covariance matrix $\Sigma\in\mathbb{R}^{3\times 3}$ in the 3D coordinate as follows:

(1)

g(x)=\exp\left({-\frac{1}{2}(x-\mu)^{T}\mathbf{\Sigma}^{-1}(x-\mu)}\right),

for a 3D coordinate $x\in\mathbb{R}^{3}$ . The covariance matrix $\Sigma$ is further decomposed into $\Sigma=RSS^{T}R^{T}$ with a scaling matrix $S$ and a rotation matrix $R$ , defined by a scaling factor $s\in\mathbb{R}^{3}$ and a learnable quaternion $r\in\mathbb{R}^{4}$ , respectively. Additionally, to encode the appearance information, each 3D Gaussian contains a set of spherical harmonics with degree $k$ such that $SH\in\mathbb{R}^{3(k+1)(k+1)}$ , along with an opacity value $\alpha\in\mathbb{R}$ . In summary, 3DGS represents a 3D scene with a set of 3D Gaussians parameters, defined as:

(2)

\mathcal{G}=\{\mu,r,s,SH,\alpha\},

Given a novel viewing direction $\pi$ , a 2D image $\hat{I}$ is rendered as:

(3)

\hat{I}=\mathcal{R}(\mathcal{G};\pi),

where $\mathcal{R}(\cdot)$ is the differentiable rasterizer.

More specifically, for $\mathcal{R}(\cdot)$ , 3DGS employs differential splatting (Yifan et al., 2019) during novel view rendering. In order to project 3D Gaussians to 2D for rendering, the covariance matrix in the 2D space, $\Sigma^{\prime}\in\mathbb{R}^{2\times 2}$ , is calculated by viewing transform $W$ and the Jacobian $J$ of the affine approximation of the projective transformation (Zwicker et al., 2001), such as:

(4)

\Sigma^{\prime}=JW\Sigma W^{T}J^{T}.

Subsequently, the color of each pixel is computed by blending all Gaussians that overlap the pixel and ordered by their depths as follows:

(5)

C=\sum_{i=1}c_{i}{\alpha}^{\prime}_{i}\prod_{j=1}^{i-1}(1-{\alpha}^{\prime}_{j% }),

where $c_{i}$ is the color of each point determined using the SH coefficient with view direction, and ${\alpha}_{i}^{\prime}$ is computed by the multiplication of the opacity $\alpha$ of the 3D Gaussian and its projected covariance $\Sigma^{\prime}$ .

4. Methodology

4.1. Problem formulation and Overview

In this section, we describe the main components of GaussianTalker, designed for the real-time synthesis of high-fidelity, pose-controllable talking head images driven by audio input. Our model is trained on a talking portrait video $\mathcal{V}=\{I_{n}\}$ consisting of $N$ number of image frames for an identity. Our objective is to reconstruct a set of canonical 3D Gaussians that represent the mean shape of the talking head, and learn a deformation module that deforms the 3D Gaussians according to corresponding input audio. During inference, for the input audio $a_{n}$ , the deformation module predicts the offsets of each Gaussian attribute, and the deformed Gaussians are rasterized at the viewing point $\pi_{n}$ to output the novel image $\hat{I}_{n}$ .

An overview of our proposed method is depicted in Fig. 2. We first introduce the multi-resolution tri-plane that encodes the low-dimensional features of the 3D Gaussians to represent the static mean shape of the canonical head in Sec. 4.2. In Sec. 4.3, we introduce the speech-motion cross-attention module that fuses 3D Gaussians features and audio features to accurately model facial motion driven by input audio. Finally, Sec. 4.4 describes the stage-wise training strategy and the utilized loss functions.

4.2. Learning canonical 3D Gaussians with triplane representation

In this section, we introduce the details of learning the canonical shape of the talking head with 3D Gaussian representation. The vanilla implementation of 3DGS (Kerbl et al., 2023) does not inherently capture the spatial relationships between neighboring and distant 3D Gaussians. However, an ideal feature representation for a dynamic 3D head should be analogous for proximal facial regions and distinct for separated ones, as the close facial primitives would likely move to the same direction.

To realize this, we modify the 3D Gaussian representation by learning a low-dimensional feature representation, which can be later merged with the audio features for per-Gaussian deformation. We formulate the embedding space to encode information of the attributes of the 3D Gaussians, in order to take into account the shape and appearance of each Gaussian when predicting its deformation offsets. More specifically, we adopt a hybrid 3D representation that utilizes the explicit 3D representation of 3DGS, while also taking advantage of the encoded spatial information of implicit neural radiance fields (Mildenhall et al., 2020). For each of the canoncial 3D positions $\mu_{c}$ , we extract feature embeddings $f(\mu_{c})$ from a multi-resolution triplane representation (Chan et al., 2022; Fridovich-Keil et al., 2023; Cao and Johnson, 2023). These feature embeddings are utilized to calculate the scale $s_{c}$ , rotation $r_{c}$ , spherical harmonics $SH_{c}$ , and opacity $\alpha_{c}$ of each point. These computed attributes make up the canonical 3D Gaussian of the talking head, denoted as:

(6)

\mathcal{G}_{\mathrm{can}}=\{\mu_{c},r_{c},s_{c},SH_{c},\alpha_{c}\}.

During training, instead of directly updating the 3D Gaussian attributes, the feature grids of the triplane and the attribute prediction networks are optimized. This allows for the feature embedding $f(\mu_{c})$ to store the region-specific facial information of the canonical 3D head, while also enforcing spatial relationships between neighboring Gaussians. In the following, we introduce the formulation of each module in detail.

4.2.1. Triplane representation for 3D Gaussian

In order to encode the spatial information of the canonical 3D head, we adopt a multi-resolution triplane representation, constructed by three orthogonal 2D feature grids, $P=\{P^{\mathrm{xy}},P^{\mathrm{yz}},P^{\mathrm{zx}}\}$ . Each of these planes has shape ${H\times R\times R}$ , where $H$ stands for the hidden dimension of features, and $R$ denotes the resolution of each dimension. For individual 3D Gaussian with position $\mu$ , each of its coordinate values is normalized between $[0,R)$ , and its corresponding features are computed by interpolating the point into a regularly spaced 2D grid for each plane. These features are combined using the Hadamard product $\prod$ for each plane, followed by concatenation $\bigcup$ along the different dimensions, to produce a final feature vector $f(\mu)$ of length $H$ for each of the canonical Gaussian position $\mu_{c}$ , such as:

(7)

f(\mu)=\bigcup\prod_{p\in P}\mathrm{interp}\big{(}p,\zeta_{p}(\mu_{c})\big{)},

where $\zeta_{p}(\mu)$ denotes a projection of $\mu$ onto the $p$ ’th plane and ‘ $\mathrm{interp}$ ’ denotes bilinear interpolation of a point into the regularly spaced 2D grid. The visualization of features in our multi-resolution triplane is depicted in Fig. 3.

4.2.2. Attribute prediction of canonical 3D Gaussians

Unlike the original 3DGS implementation shown in (2), we do not explicitly store the shape information $r$ and $s$ , and the appearance information $SH$ and $\alpha$ . Instead, these attributes are obtained from the corresponding feature representation $f(\mu)$ . Specifically, we employ a set of MLP layers, denoted as $\mathcal{F}_{\mathrm{can}}(\cdot)$ , to map the feature to the mean scale $s_{c}$ , mean rotation $r_{c}$ , mean spherical harmonics $SH_{c}$ , and mean opacity value $\alpha_{c}$ from $f(\mu)$ , such as:

(8)

\{s_{c},r_{c},SH_{c},\alpha_{c}\}={\mathcal{F}}_{\mathrm{can}}\big{(}f(\mu)% \big{)}.

Compared to the original 3DGS (Kerbl et al., 2023) where each Gaussian is optimized independently, our hybrid representation conditioned on an implicit feature volume enforces shared facial information between adjacent points.

4.3. Learning audio-driven deformation of 3D Gaussians

Previous works (Guo et al., 2021; Liu et al., 2022; Tang et al., 2022; Li et al., 2023) employ a conditional NeRF representation, wherein the 3D coordinates of the sampling point along each ray remain fixed, with only color and density conditioned to input audio. However, in order to fully benefit from the explicit representation of 3DGS, we choose to deform the 3D Gaussians, where we manipulate not only the appearance information but also the spatial positions and shape of each Gaussian primitive. While this can more accurately capture the constantly fluctuating 3D shape of the talking head, deformation of 3D Gaussians is a much more complex task compared to controlling a NeRF representation. The intricate nature of Gaussian primitives, coupled with their sheer quantity, presents significant challenges for deformation due to the extensive parameter space of 3D Gaussians. In addition, input audio does not impact the whole facial image uniformly, making it vital for the deformation module to understand how varying facial regions respond to audio conditions for authentic facial animation.

In order to model the relations between the dynamic features and the vast amount of 3D Gaussians, we fuse the input speech audio $a_{n}$ with the encoded feature $f({\mu}_{c})$ in an attention mechanism, in order to produce the audio-aware feature $h_{n}$ for the $n$ -th image frame. The deformation offsets of each Gaussian attribute for subsequent frames are directly conditioned on the feature $h_{n}$ . Finally, the deformed set of 3D Gaussian for the $n$ -th image frame is defined as:

(9)

\mathcal{G}_{\mathrm{deform},n}=\{{\mu}_{c}+\Delta\mu_{n},r_{c}+\Delta r_{n},s% _{c}+\Delta s_{n},SH_{c}+\Delta SH_{n},{\alpha}_{c}+\Delta\alpha_{n}\},

where $\Delta\mu_{n},\Delta{s}_{n},\Delta{r}_{n},\Delta{SH}_{n},\Delta{\alpha_{n}}$ are the deformation offsets at $n$ -th frame for 3D position, scale, rotation, spherical harmonics parameters and opacity, respectively. The details of each module is introduced in the following paragraphs.

4.3.1. Spatial-audio cross-attention

Previous approaches to implement region-aware audio, like ER-NeRF (Li et al., 2023), simply adjust the weights for the audio features at each 3D point through elementwise multiplication. However, it encounters a challenge in that, regardless of the diverse audio inputs in a dynamic scene, a particular static 3D point consistently maintains the same audio weight. This fails to acknowledge that a fixed 3D coordinate may not consistently correspond to the same facial region as the scene progresses. To address this issue and enhance the extraction of spatial-audio features, we introduce spatial-audio cross-attention module, a cross-attention mechanism that merges spatial feature embedding $f(\mu_{c})$ of the canonical 3D Gaussians with subsequent audio features, capturing how the input speech audio affects the movement of the 3D Gaussians. The spatial-audio cross-attention module comprises $L$ sets of cross-attention layer $\mathcal{T}_{CA}(\cdot)$ and feed-forward layer $FFN(\cdot)$ , each interconnected with skip connections. The module is formulated as:

(10)

z_{n}^{0}=f(\mu_{c}),

(11)

{z^{\prime}_{n}}^{l}=\mathcal{T}_{CA}(z^{l-1}_{n},a_{n})+z^{l-1}_{n},\quad l=1% ...L,

(12)

z^{l}_{n}=FFN({z^{\prime}_{n}}^{l})+{z^{\prime}_{n}}^{l},\quad l=1...L,

whereby the cross-attention between the spatial feature $f$ and the audio feature $a_{n}$ of the $n$ -th image frame is computed. As a result, the output feature $z^{L}_{n}$ successfully amalgamates audio features with the rich facial details captured by each 3D Gaussian. This cross-attention module offers a more nuanced and stable method of feature combination than simple concatenation or multiplication, as the module reforms the spatial-aware facial features with respect to the subsequent audio features, taking into account the dynamic variability inherent in each 3D Gaussian.

4.3.2. Disentanglement of speech-related motion.

When synthesizing a talking head, the corresponding speech audio does not account for all the intricate and diverse facial movements. Subtle expressions like eye blinks and facial wrinkles, along with external factors such as hair movement and variations in lighting, do not directly correlate with input speech audio. Thereby, it is crucial to separate the non-verbal motions and scene variations when map** speech audio to the 3D Gaussian deformation. In this section, we address this challenge by introducing additional input conditions that capture non-verbal motions, allowing us to disentangle speech-related motion from the monocular video.

Following previous works (Tang et al., 2022; Li et al., 2023), we first apply explicit eye blinking control with the eye feature $e$ . Specifically, we employ AU45 from the Facial Action Coding System (Ekman and Friesen, 1978) to describe the degree of the eye blink, and utilize a sinusoidal positional encoding in order to match the input dimensions. Additionally, we integrate the camera viewpoint as an auxiliary input to disentangle non-verbal scene variations. While we formulate the framewise camera $\pi_{n}$ as facial viewpoints, the typical video is recorded with a static camera while the head undergoes continuous movement. Consequently, variations in the portrait image, such as hair displacement and lighting changes, occur independently of the speech audio. Hence, we employ a facial viewpoint embedding $\upsilon$ as an additional input condition to disentangle these non-auditory scene fluctuations. $\upsilon_{n}$ is an embedding vector obtained by map** the extrinsic camera pose $\pi_{n}$ to a small MLP to have the same dimensionality as the other inputs. Finally, we discovered that using a single null-vector ( $\emptyset$ ) for all frames promotes consistency as a global feature across video frames. We incorporate this null-vector as an additional input for our cross-attention network. Thus, we reformulate (11) as:

(13)

{z^{\prime}_{n}}^{l}=\mathcal{T}_{CA}(z^{l-1}_{n},\{a_{n},e_{n},\upsilon_{n},% \emptyset\})+z^{l-1}_{n},\quad l=1...L.

In Fig. 4, we visualize the attention scores for each input in order to demonstrate the efficacy of disentangling audio-related motion. Further details on the network structure and visualization procedure are provided in the supplementary file.

4.3.3. Audio-conditioned deformation of 3D Gaussian

The final deformation network takes the spatially-aware audio features encoded in each 3D Gaussians in order to compute the deformation of position, rotation, and scaling. We define the set of MLP regressors $\mathcal{F}_{\mathrm{deform}}(\cdot)$ in order to predict the offsets of each Gaussian attributes, such as:

(14)

\{\Delta\mu_{n},\Delta{s}_{n},\Delta{r}_{n},\Delta{SH}_{n},\Delta{\alpha_{n}}% \}={\mathcal{F}}_{\mathrm{deform}}(z^{L}_{n}).

Table 1. Quantitative comparison under the self-driven setting. The top, second-best, and third-best results are shown in red, orange, and yellow, respectively.

Methods	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	LMD $\downarrow$	AUE $\downarrow$	Sync $\uparrow$	CSIM $\uparrow$	Training Time $\downarrow$	FPS $\uparrow$
Ground Truth	N/A	$1$	$0$	$0$	$0$	$0$	$8.653$	$1$	N/A	N/A
Wav2Lip (Prajwal et al., 2020)	$30.461$	$0.911$	$0.024$	$33.074$	$4.458$	$1.761$	$9.606$	$0.887$	-	19
PC-AVS (Zhou et al., 2021)	$21.958$	$0.699$	$0.053$	$42.646$	$4.619$	$1.875$	$9.185$	$0.519$	-	32
AD-NeRF (Guo et al., 2021)	$30.341$	$0.906$	$0.026$	$20.243$	$5.692$	$2.331$	$4.939$	$0.908$	$13$ h	0.13
RAD-NeRF (Tang et al., 2022)	$30.703$	$0.915$	$0.026$	$26.238$	$3.142$	$2.196$	$5.757$	$0.911$	$3$ h	32
ER-NeRF (Li et al., 2023)	$31.673$	$0.919$	$0.014$	$19.829$	$3.003$	$1.974$	$5.976$	$0.922$	$1$ h	34
GaussianTalker^∗	$32.269$	$0.930$	$0.016$	$10.771$	$2.711$	$1.758$	$6.443$	$0.933$	$1$ h	121
GaussianTalker	$32.423$	$0.931$	$0.018$	$11.951$	$2.928$	$2.292$	$6.554$	$0.932$	1.5h	98

4.4. Training

4.4.1. Stage-wise optimization

3DGS (Kerbl et al., 2023) showed that the quality of reconstruction is influenced by the initialization of 3D Gaussians. Similarly, the training of the deformation field should also be conducted using a proper initialization of the canonical facial shape. To this end, we employ a two-stage training approach.

In the first stage, canonical stage, we first reconstruct the mean shape of the talking face, by optimizing the positions of 3D Gaussians and the multi-resolution triplane. Instead of the conventional initialization using structure from motion (SFM) points, we opt to utilize the 3D coordinates of the mesh vertices from fitting 3D morphable models. Note that the 3DMM fitting of each frame involves no extra preprocessing, as this is a necessary part of obtaining the camera parameters of the talking face and is widely adopted in NeRF-based talking face synthesis works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023). The static image of the canonical talking head is rasterized via:

(15)

\hat{I}_{\mathrm{can}}=R(\mathcal{G}_{\mathrm{can}};\pi_{n}).

This is followed by the deformation stage, where we optimize the whole network, from which we learn the cross-attention deformation network. For each frame, the dynamic talking head video frame can be rendered as:

(16)

\hat{I}_{n}=R(\mathcal{G}_{\mathrm{deform},n};\pi_{n}).

4.4.2. Loss Functions

For the canonical stage for a static shape of talking head, we follow the original 3DGS implementation (Kerbl et al., 2023) and utilize a combination of L1 color loss $\mathcal{L}_{\mathrm{1}}$ and a D-SSIM term $\mathcal{L}_{\mathrm{D-SSIM}}$ . Following previous audio-driven NeRF works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023), we also utilize LPIPS (Zhang et al., 2018) loss $\mathcal{L}_{\mathrm{lpips}}$ to capture sharp details. For a given input frame $I$ , the overall loss function of the canonical stage is denoted as $\mathcal{L}_{\mathrm{can}}=\mathcal{L}_{\mathrm{L1}}+\lambda_{\mathrm{lpips}}% \mathcal{L}_{\mathrm{lpips}}+\lambda_{\mathrm{D-SSIM}}\mathcal{L}_{\mathrm{D-% SSIM}}$ . During the deformation stage, we employ an additional loss function on the lip area of the talking head. Specifically, we apply a reconstruction loss for the image patch obtained by crop** where the lips are located based on the facial landmarks (Bulat and Tzimiropoulos, 2017). Thus, the total loss function for the deformation stage can be formulated as $\mathcal{L}_{\mathrm{deform}}=\mathcal{L}_{\mathrm{can}}+\lambda_{\mathrm{lip}% }\mathcal{L}_{\mathrm{lip}}$ . Note that the deformed 3D Gaussians are directly splatted onto the combined background and torso image, in order to render the head with the background and torso, a common technique that prevents noise around the facial contours (Tang et al., 2022; Li et al., 2023). A more detailed explanation of this technique can be found in the supplementary file.

5. Experiments

Table 2. Quantitative comparison under the cross-driven setting. We extract two audio clips from SynObama demo (Suwajanakorn et al., 2017) to drive each method and compare lip synchronization.

	Testset A			Testset B
Methods	Sync $\uparrow$	LMD $\downarrow$	AUE $\downarrow$	Sync $\uparrow$	LMD $\downarrow$	AUE $\downarrow$
Ground Truth	$7.850$	$0$	$0$	$6.976$	$0$	$0$
Wav2Lip (Prajwal et al., 2020)	$8.272$	$7.102$	$2.023$	$7.907$	$5.591$	$3.164$
PC-AVS (Zhou et al., 2021)	$8.408$	$7.731$	$2.212$	$7.592$	$6.230$	$3.123$
AD-NeRF (Guo et al., 2021)	$5.128$	$18.986$	$3.654$	$5.109$	$9.221$	$3.266$
RAD-NeRF (Tang et al., 2022)	$5.126$	$12.485$	$3.611$	$4.497$	$7.760$	$3.447$
ER-NeRF (Li et al., 2023)	$4.694$	$12.477$	$3.779$	$4.822$	$7.698$	$3.287$
GaussianTalker	$5.356$	$12.702$	$3.663$	$5.413$	$7.812$	$3.265$

5.1. Experimental Settings

5.1.1. Dataset and pre-processing.

For each target subject, we require several minutes of talking portrait video with a corresponding audio track for training. Specifically, the datasets are obtained from publicly-released video datasets utilized in previous NeRF-based works (Guo et al., 2021; Liu et al., 2022; Shen et al., 2022; Ye et al., 2022), averaging 6,000 frames for each video at 25 fps. We also perform experiments on selected video clips sourced from the HDTF dataset. (Zhang et al., 2021). Each portrait video is cropped and resized to $512\times 512$ , apart from the Obama video, which is of the resolution $450\times 450$ . We split each video into train and test sets at a ratio of 10:1, following the pre-processing steps introduced in AD-NeRF (Guo et al., 2021).

5.1.2. Comparison baselines.

We comparatively evaluate our proposed GaussianTalker framework against recent NeRF-based approaches tackling the same task. We introduce two variants of our method: the full model GaussianTalker with $L=2$ cross-attention layers and a lightweight version, GaussianTalker^∗, with $L=1$ layer. Our method is compared with the recent NeRF-based approaches that address the same problem settings. We utilize three models as baselines: AD-NeRF (Guo et al., 2021), RAD-NeRF (Tang et al., 2022), and ER-NeRF (Li et al., 2023). For fair comparison, we implement each method by utilizing the torso part from the ground-truth frames. Additionally, we include a comparison with one-shot 2D talking head models, such as Wav2Lip (Prajwal et al., 2020) and PC-AVS (Zhou et al., 2021), to provide a wide range of comparisons.

5.2. Quantitative Evaluation

5.2.1. Comparison settings and metrics.

Following previous works (Tang et al., 2022; Li et al., 2023), our comparisons are structured into two distinct settings: self-driven and cross-driven. In the self-driven setting, we evaluate the accuracy of head reconstruction for a particular identity using the test subset. We employ several reconstruction metrics including peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS). Notably, these metrics are exclusively measured on the facial region. We also measure realism of the reconstructed face using Fréchet Inception Distance (FID) (Heusel et al., 2017) and identity preservation of the animated video using Cosine Similarity of Identity Embedding (CSIM) (Huang et al., 2020).

For the cross-driven setting, all methods are driven by entirely unrelated audio tracks to evaluate lip synchronization. The audio clips used in this setup were extracted from demos of SynObama (Suwajanakorn et al., 2017). Due to the absence of ground-truth images, we assess lip sync accuracy with landmark distance (LMD) and SyncNet confidence score (Sync). We also employ action units error (AUE) to measure the precision of facial movements. Finally, we compare the training time and frames-per-second (FPS) as measures to evaluate the efficiency of each method.

5.2.2. Self-driven evaluation.

The self-driven evaluation results are presented in Tab. 1. Note that Wav2Lip (Prajwal et al., 2020) scores for PSNR, SSIM and LPIPS are not valid as it takes ground truth images as input. While the one-shot 2D-based methods, Wav2Lip and PC-AVS generate results with high synchronization scores, they fall short in the faithful reconstruction, showing low PSNR and LPIPS scores. Benefiting from the 3DGS representation, GaussianTalker achieves comparable image fidelity with significantly faster rendering speeds (over 120 fps for GaussianTalker*). Our method also shows the best scores in most metrics while reaching higher score than other NeRF-based baselines in Sync scores. The results show that our method can synthesize high lip-sync accurate 3D heads in real time rendering speeds.

5.2.3. Cross-driven evaluation.

Results in Table 2 showcase successful lip movement synthesis with general audio input. GaussianTalker consistently exhibits the highest Sync score among NeRF-based methods, demonstrating its effectiveness in handling unseen audio for lip synchronization. These results highlight GaussianTalker’s ability to generate high-fidelity 3D heads with real-time rendering speeds and accurate lip synchronization even with diverse audio inputs.

5.3. Qualitative Evaluation

In Fig. 5, we showcase results from self-driven and cross-driven experiments. We choose four key frames from each of the two experiment settings to compare the reconstruction quality and lip-sync accuracy. While 2D-based methods (Wav2Lip, PC-AVS) excel in lip synchronization, they for short of generating a faithful and consistent face when the head is rotated. AD-NeRF suffers from blurry reconstructions due to its lack of eye blink control. RAD-NeRF and ER-NeRF, while demonstrating improved facial consistency, can exhibit discrepancies in lip synchronization and fail to capture hair movement during head rotations.

In contrast, GaussianTalker generates photorealistic images with intricate details in non-rigid regions like eyes and wrinkles. Our spatial-audio attention module effectively disentangles audio-driven motions from scene variations, enabling precise control of mouth movements. This capability allows our model to capture hair movement realistically when the head rotates, leading to superior overall head reconstruction fidelity. In order to comprehensively visualize the efficacy of our proposed method, we provide the rendered videos in the supplementary file. The provided supplementary video demonstrates impressive lip synchronization capabilities and high fidelity head reconstruction with realistic motion.

5.4. Ablation Study

In this section, we provide ablation studies to validate the efficacy of the design choices of our model. We also show detailed visualizations of the generated results in the supplementary material for better comparison.

Table 3. Ablation study results comparing various attribute configurations for embedding canonical 3D Gaussian attributes.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	LMD $\downarrow$	Sync $\uparrow$
Ground Truth	N/A	$0$	$0$	$0$	$8.935$
$s,r,SH,\alpha$	$33.195$	$0.016$	$9.976$	$2.873$	$6.927$
$SH,\alpha$	$33.299$	$0.014$	$9.808$	$2.891$	$6.853$
$r,s$	$33.056$	$0.016$	$11.775$	$2.873$	$6.892$
random init.	$33.040$	$0.017$	$11.915$	$2.996$	$6.543$

Table 4. Ablation study on selection of deformed attributes.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	LMD $\downarrow$	Sync $\uparrow$
Ground Truth	N/A	$0$	$0$	$0$	$8.935$
$\Delta SH,\Delta\alpha$	$32.746$	$0.021$	$44.933$	$3.179$	$6.694$
$\Delta\mu,\Delta r,\Delta s$	$33.036$	$0.013$	$17.52$	$2.970$	$6.688$
$\Delta\mu,\Delta r,\Delta s,\Delta SH\Delta\alpha$	$33.299$	$0.013$	$9.808$	$2.890$	$6.928$

Table 5. Ablation study on augmented input conditions.

Method	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	LMD $\downarrow$	Sync $\uparrow$
Ground Truth	N/A	$0$	$0$	$0$	$8.935$
w/o null-vec	$32.997$	$0.014$	$9.908$	$2.933$	${6.698}$
w/o eye feature	$32.826$	$0.015$	$10.060$	$2.902$	$6.911$
w/o viewpoint	$31.866$	$0.019$	$13.231$	$3.052$	$6.563$
All (Ours)	$33.299$	$0.014$	$9.809$	$2.891$	$6.928$

Table 6. Ablation study on the effectiveness of stage-wise training.

Method	iter.	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	LMD $\downarrow$	Sync $\uparrow$
Ground Truth	-	N/A	$0$	$0$	$0$	$8.935$
w/o stage-wise	500	$26.063$	$0.072$	$66.629$	$3.446$	$1.348$
	1000	$26.478$	$0.064$	$56.890$	$3.344$	$4.007$
	5000	$32.676$	$0.016$	$14.026$	$2.971$	$6.602$
w/ stage-wise	500	$31.076$	$0.029$	$31.301$	$3.792$	$1.548$
	1000	$31.923$	$0.024$	$20.366$	$3.245$	$4.449$
	5000	$32.733$	$0.014$	$11.173$	$2.923$	$6.736$

5.4.1. Attribute conditions for triplane

Our proposed triplane encodes the facial information of the canonical 3D head learned by 3D Gaussians. The mechanism also enforces spatial relationships between Gaussians for better deformation. In Tab. 4, we demonstrate the effectiveness of this approach by conducting quantitative ablation on the selection of attributes that are conditioned on the embedding $f(\mu_{c})$ . We also provide results where all attributes are optimized separately following the original implementation, and the triplane is trained in the deformation stage. Utilizing only subsets of the Gaussian attributes show lower performance in lip synchronization and precision. Removing the attribute conditions during training leads to loss of spatial information embedded in the triplane embeddings, leading to a lack of facial cohesion during inference time.

5.4.2. Selection of deformed attributes.

A major challenge of manipulating the Gaussians is the magnitude of the parameters that need to be controlled. While estimating offsets for only a subset of attributes could reduce computational load, it may compromise overall fidelity due to the lack of control. To address this, in Tab. 4, we investigate different selections of Gaussian attributes for deformation. Controlling only $SH$ and $\alpha$ makes the formulation similar to conditional NeRF-based works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023). Because 3DGS is an explicit representation that specifies the 3D positions and shapes, only controlling the appearance attributes leads to loss of overall fidelity. However, only controlling attributes that make up the position and shape of 3D Gaussians show lower reconstruction accuracy. Deformation of all Gaussian attribute is crucial for the highest fidelity and superior lip synchronization.

5.4.3. Disentanglement of audio-unrelated motion.

We also investigate the significance of using augmented conditions, such as eye blink, facial viewpoint, and null-vector. We evaluate the influence of additional conditions on image fidelity and lip synchronization by selectively removing them during training (Table 6). The lower reconstruction scores are attributed to the low lip-sync accuracy due to entanglement of verbal motion and scene variations unrelated to audio. In the supplementary material, we also visualize the attention scores of each comparison experiment for detailed analysis.

5.4.4. Stagewise optimization

In Fig. 6, we investigate the importance of employing a separate canonical stage. We opt to optimize the whole architecture by training each of the module simultaneously from scratch. While the final generated results show similar performance, optimizing the coarse facial geometry before training the deformation network results in faster optimization of the whole methodology.

6. Conclusion

In this work, we have proposed GaussianTalker, a novel framework for real-time pose-controllable 3D talking head synthesis, leveraging the 3D Gaussians for the head representation. Our method enables precise control over Gaussian primitives by conditioning features extracted from a multi-resolution triplane. Additionally, the integration of a spatial-audio cross-attention module facilitates the dynamic deformation of facial regions, allowing for nuanced adjustments based on audio cues and enhancing verbal motion disentanglement. Our method is distinguished from prior NeRF-based methods by its superior inference speed and high-fidelity results for out-of-domain audio tracks. The efficacy of our approach is validated by quantitative and qualitative analyses. We look forward to enriched user experiences, particularly in video game development, where real-time rendering capabilities of GaussianTalker promise to enhance interactive digital environments.

References

(1)
Athar et al. (2022) ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, Eli Shechtman, and Zhixin Shu. 2022. Rignerf: Fully controllable neural 3d portraits. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 20364–20373.
Bulat and Tzimiropoulos (2017) Adrian Bulat and Georgios Tzimiropoulos. 2017. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the IEEE international conference on computer vision.
Cao and Johnson (2023) Ang Cao and Justin Johnson. 2023. HexPlane: A Fast Representation for Dynamic Scenes. CVPR (2023).
Cen et al. (2023) Jiazhong Cen, Jiemin Fang, Chen Yang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. 2023. Segment any 3d gaussians. arXiv preprint arXiv:2312.00860 (2023).
Chan et al. (2022) Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. 2022. Efficient Geometry-Aware 3D Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
Chen et al. (2019) Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7832–7841.
Chen et al. (2023) Yufan Chen, Lizhen Wang, Qi**g Li, Hongjiang ** Zhang, Hongxun Yao, and Yebin Liu. 2023. Monogaussianavatar: Monocular gaussian point-based head avatar. arXiv preprint arXiv:2312.04558 (2023).
Deng et al. (2019) Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops.
Dhamo et al. (2023) Helisa Dhamo, Yinyu Nie, Arthur Moreau, Jifei Song, Richard Shaw, Yiren Zhou, and Eduardo Pérez-Pellitero. 2023. Headgas: Real-time animatable head avatars via 3d gaussian splatting. arXiv preprint arXiv:2312.02902 (2023).
Ekman and Friesen (1978) Paul Ekman and Wallace V. Friesen. 1978. Facial Action Coding System: Manual. Palo Alto: Consulting Psychologists Press.
Fang et al. (2023) Jiemin Fang, Junjie Wang, Xiaopeng Zhang, Lingxi Xie, and Qi Tian. 2023. Gaussianeditor: Editing 3d gaussians delicately with text instructions. arXiv preprint arXiv:2311.16037 (2023).
Fang et al. (2022) Jiemin Fang, Taoran Yi, Xinggang Wang, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Matthias Nießner, and Qi Tian. 2022. Fast Dynamic Radiance Fields with Time-Aware Neural Voxels. In SIGGRAPH Asia 2022 Conference Papers.
Fridovich-Keil et al. (2023) Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, Benjamin Recht, and Angjoo Kanazawa. 2023. K-Planes: Explicit Radiance Fields in Space, Time, and Appearance. arXiv preprint arXiv:2301.10241 (2023).
Gao et al. (2024) Lin Gao, Jie Yang, Bo-Tao Zhang, Jia-Mu Sun, Yu-Jie Yuan, Hongbo Fu, and Yu-Kun Lai. 2024. Mesh-based Gaussian Splatting for Real-time Large-scale Deformation. arXiv preprint arXiv:2402.04796 (2024).
Gao et al. (2022) Xuan Gao, Chenglai Zhong, Jun Xiang, Yang Hong, Yudong Guo, and Juyong Zhang. 2022. Reconstructing personalized semantic facial nerf models from monocular video. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–12.
Grassal et al. (2022) Philip-William Grassal, Malte Prinzler, Titus Leistner, Carsten Rother, Matthias Nießner, and Justus Thies. 2022. Neural head avatars from monocular rgb videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18653–18664.
Guo et al. (2021) Yudong Guo, Keyu Chen, Sen Liang, Yong-** Liu, Hujun Bao, and Juyong Zhang. 2021. AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5784–5794.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NeurIPS.
Hu and Liu (2024) Shoukang Hu and Ziwei Liu. 2024. GauHuman: Articulated Gaussian Splatting from Monocular Human Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Huang et al. (2020) Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. 2020. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition. In CVPR.
Jamaludin et al. (2019) Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You Said That?: Synthesising Talking Faces from Audio. International Journal of Computer Vision 127 (2019), 1767–1779.
Kerbl et al. (2023) Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 2023. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (ToG) 42, 4 (2023), 1–14.
Khakhulin et al. (2022) Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, and Egor Zakharov. 2022. Realistic one-shot mesh-based head avatars. In European Conference on Computer Vision. Springer, 345–362.
Li et al. (2023) Jiahe Li, Jiawei Zhang, Xiao Bai, Jun Zhou, and Lin Gu. 2023. Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis. arXiv preprint arXiv:2307.09323 (2023).
Li et al. (2017) Tianye Li, Timo Bolkart, Michael J. Black, Hao Li, and Javier Romero. 2017. Learning a Model of Facial Shape and Expression from 4D Scans. ACM Trans. Graph. 36, 6, Article 194 (nov 2017), 17 pages.
Li et al. (2024) Zhe Li, Zerong Zheng, Lizhen Wang, and Yebin Liu. 2024. Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Liu et al. (2022) Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. 2022. Semantic-Aware Implicit Neural Audio-Driven Video Portrait Generation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII. Springer, 106–125.
Liu et al. (2023) Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, and Haoqian Wang. 2023. Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars. arXiv preprint arXiv:2311.16482 (2023).
Lu et al. (2021) Yuanxun Lu, **xiang Chai, and Xun Cao. 2021. Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation. ACM Trans. Graph. 40, 6, Article 220 (dec 2021), 17 pages. https://doi.org/10.1145/3478513.3480484
Luiten et al. (2023) Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. 2023. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. arXiv:2308.09713 [cs.CV]
Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. ACM Transactions on Graphics (ToG) 41, 4 (2022), 1–15.
Prajwal et al. (2020) KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild. In Proceedings of the 28th ACM International Conference on Multimedia. 484–492.
Qian et al. (2023) Shenhan Qian, Tobias Kirschstein, Liam Schoneveld, Davide Davoli, Simon Giebenhain, and Matthias Nießner. 2023. Gaussianavatars: Photorealistic head avatars with rigged 3d gaussians. arXiv preprint arXiv:2312.02069 (2023).
Shen et al. (2022) Shuai Shen, Wanhua Li, Zheng Zhu, Yueqi Duan, Jie Zhou, and Jiwen Lu. 2022. Learning Dynamic Facial Radiance Fields for Few-Shot Talking Head Synthesis. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XII. Springer, 666–682.
Song et al. (2022) Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2022. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security 17 (2022), 585–598.
Sun et al. (2021) Yasheng Sun, Hang Zhou, Ziwei Liu, and Hideki Koike. 2021. Speech2Talking-Face: Inferring and Driving a Face with Synchronized Audio-Visual Representation.. In IJCAI, Vol. 2. 4.
Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (ToG) 36, 4 (2017), 1–13.
Tang et al. (2022) Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, **gtuo Liu, Gang Zeng, and **gdong Wang. 2022. Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition. arXiv preprint arXiv:2211.12368 (2022).
Thies et al. (2020) Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, and Matthias Nießner. 2020. Neural Voice Puppetry: Audio-Driven Facial Reenactment. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16. Springer, 716–731.
Wang et al. (2024) Jie Wang, Jiu-Cheng Xie, Xianyan Li, Feng Xu, Chi-Man Pun, and Hao Gao. 2024. GaussianHead: High-fidelity Head Avatars with Learnable Gaussian Derivation. arXiv:2312.01632 [cs.CV]
Wang et al. (2020) Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face Generation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI. Springer, 700–717.
Wiles et al. (2018) Olivia Wiles, A Sophia Koepke, and Andrew Zisserman. 2018. X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes. In Computer Vision–ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XIII 15. Springer, 690–706.
Wu et al. (2023) Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2023. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering. arXiv:2310.08528 [cs.CV]
Yang et al. (2023a) Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang **. 2023a. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction. arXiv:2309.13101 [cs.CV]
Yang et al. (2023b) Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. 2023b. Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting. arXiv:2310.10642 [cs.CV]
Yao et al. (2022) Shunyu Yao, RuiZhe Zhong, Yichao Yan, Guangtao Zhai, and Xiaokang Yang. 2022. DFA-NeRF: Personalized Talking Head Generation via Disentangled Face Attributes Neural Rendering. arXiv preprint arXiv:2201.00791 (2022).
Ye et al. (2023) Zhenhui Ye, **zheng He, Ziyue Jiang, Rongjie Huang, Jiawei Huang, **glin Liu, Yi Ren, Xiang Yin, Zejun Ma, and Zhou Zhao. 2023. GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation. arXiv preprint arXiv:2305.00787 (2023).
Ye et al. (2022) Zhenhui Ye, Ziyue Jiang, Yi Ren, **glin Liu, **zheng He, and Zhou Zhao. 2022. GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. In The Eleventh International Conference on Learning Representations.
Yifan et al. (2019) Wang Yifan, Felice Serena, Shihao Wu, Cengiz Öztireli, and Olga Sorkine-Hornung. 2019. Differentiable surface splatting for point-based geometry processing. ACM Transactions on Graphics (TOG) 38, 6 (2019), 1–14.
Yin et al. (2022) Fei Yin, Yong Zhang, Xiaodong Cun, Mingdeng Cao, Yanbo Fan, Xuan Wang, Qingyan Bai, Baoyuan Wu, Jue Wang, and Yujiu Yang. 2022. Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan. In European conference on computer vision. Springer, 85–101.
Yu et al. (2020) Lingyun Yu, Jun Yu, Mengyan Li, and Qiang Ling. 2020. Multimodal inputs driven talking face generation with spatial–temporal dependency. IEEE Transactions on Circuits and Systems for Video Technology 31, 1 (2020), 203–216.
Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
Zhang et al. (2023) Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, and Fei Wang. 2023. SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8652–8661.
Zhang et al. (2021) Zhimeng Zhang, Lincheng Li, Yu Ding, and Changjie Fan. 2021. Flow-Guided One-Shot Talking Face Generation With a High-Resolution Audio-Visual Dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Zheng et al. (2022) Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, Xu Chen, Michael J Black, and Otmar Hilliges. 2022. Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13545–13555.
Zhou et al. (2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4176–4186.
Zhou et al. (2020) Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. Makelttalk: speaker-aware talking-head animation. ACM Transactions On Graphics (TOG) 39, 6 (2020), 1–15.
Zwicker et al. (2001) Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. 2001. Surface splatting. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 371–378.

Appendix

In the following, we describe the implementation details and further analyses of GaussianTalker. Specifically, we first introduce the details of our network design and hyperparameter settings in Sec. A. We also provide details of our analysis on the proposed method that was conducted in the main paper in Sec. B. In Sec. C, we validate our methodology with more qualitative results from our experiments, and also conduct a user study. Then, more ablation studies are conducted in Sec. D. To further demonstrate the robustness and effectiveness of our framework, we also provide a supplementary video (Sec. E). Finally, we discuss the limitations and ethical considerations of our research in Sec. F.

Appendix A Implementation Details

A.1. Network architecture

A.1.1. Multi-resolution Triplane.

Our multi-resolution triplane consists of three orthogonal grids, with the hidden feature dimension of $H=64$ , and its base resolution of $R=64$ , which is further upsampled by 2.

A.1.2. Canonical 3D Gaussian attribute predictor.

The employed network that predicts the attributes of canoncial 3D Gaussians is made up of MLPs, such as: $\mathcal{F}_{\mathrm{can}}=\{\phi_{\mathrm{shared}},\phi_{r},\phi_{s},\phi_{SH% },\phi_{\alpha}\}$ . Specifically, a tiny MLP $\phi_{\mathrm{shared}}$ encodes the triplane embedding $f(\mu_{c})$ and outputs a shared feature $\kappa$ for all attributes. The following MLP regressors maps this feature to each 3D Gaussian attribute such as:

(17)

\begin{gathered}\kappa=\phi_{\mathrm{shared}}(f(\mu)),\\ r_{c}=\phi_{r}(\kappa),\;s_{c}=\phi_{s}(\kappa),\;SH_{c}=\phi_{SH}(\kappa),\;% \alpha_{c}=\phi_{\alpha}(\kappa).\end{gathered}

A.1.3. Deformation offset predictor.

Similar to $\mathcal{F}_{\mathrm{can}}$ , the deformation prediction network, $\mathcal{F}_{\mathrm{can}}=\{\psi_{\mu},\psi_{r},\psi_{s},\psi_{SH},\psi_{% \alpha}\}$ , that estimates the deformation offsets of each Gaussian attribute for each frame consists of several small MLP regressors. For the $n$ -th frame, the final output embedding from the cross-attention module, $z^{L}_{n}$ , is mapped to each attribute offset such that

(18)

\begin{gathered}\Delta\mu_{n}=\psi_{\mu}(z^{L}_{n}),\;\Delta{r}_{n}=\psi_{r}(z% ^{L}_{n}),\;\Delta{s}_{n}=\psi_{s}(z^{L}_{n}),\\ \Delta{SH}_{n}=\psi_{SH}(z^{L}_{n}),\;\Delta{\alpha}_{n}=\psi_{\alpha}(z^{L}_{% n}).\end{gathered}

A.2. Hyperparameter Configuration

During the canonical stage, we conduct training over $8,000$ iterations for a specific identity. We set the weights for the loss functions as follows: $\lambda_{1}=0.8$ , $\lambda_{\mathrm{lpips}}=0.01$ , and $\lambda_{\mathrm{D-SSIM}}=0.2$ . The initial learning rate for the multi-resolution triplane is set to 0.0016, gradually decaying to 0.00016. Similarly, the learning rate for $\mathcal{F}_{\mathrm{can}}$ starts at 0.0001 and diminishes to 0.00001. We cap the maximum number of 3D Gaussians at 50,000, and we abstain from utilizing the opacity reset operation from the original implementation (Kerbl et al., 2023), as we found it does not yield discernible benefits in our experiments.

Subsequently, in the deformation stage, we proceed with training the network for 8,000 iterations. We maintain the same weighting scheme for the loss functions: $\lambda_{1}=0.8$ , $\lambda_{\mathrm{lpips}}=0.01$ , $\lambda_{\mathrm{D-SSIM}}=0.2$ , and $\lambda_{\mathrm{lip}}=0.8$ . All modules are trained with an initial learning rate of 0.0001, gradually decreasing to 0.00001.

While our spatial-audio cross-attention module primarily employs $L=2$ cross-attention layers, our modified GaussianTalker^∗ with $L=1$ can achieve comparable results with even faster inference speeds.

Appendix B Additional Analysis

B.1. Splatting on the background image

Initially, our research followed the method outlined in the original implementation (Kerbl et al., 2023), where faces were generated on a white background. However, we encountered limitations with this approach. To render images containing only faces on a white background, corresponding ground truth images with similar characteristics were required, necessitating the use of a segmentation model. However, due to the inherent inaccuracies of the segmentation model, the obtained facial masks tended to encompass larger areas, including the background. Additionally, the disproportionate emphasis of loss terms such as SSIM and perceptual loss on imperfect facial contours relative to mouth and eye movements hindered the learning process.

As a solution, we opted to generate faces against GT backgrounds instead. This approach allowed for the accurate learning of Gaussian presence boundaries by distributing loss across the entire image. Similar to preprocessing techniques employed in previous NeRF-based works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023), we interpolated the human form from the background image to create an image with the person removed. Subsequently, faces were directly rendered using Gaussian methods, enabling comparisons with GT videos. By adopting this strategy, our GaussianTalker is trained without the need for facial mask, facilitating the faithful representation of intricate details such as hair.

B.2. Visualization of Attention

In our spatial-audio cross-attention module, the computation of the attention score is formalized by the following equation:

(19)

(\mathrm{A}_{n})^{l}=\frac{\mathrm{softmax}(qk^{\intercal}_{n})^{l}}{\sqrt{d_{% k}}},

where $l$ denotes the index of $\{a_{n},e_{n},\upsilon_{n},\emptyset\}$ and $(\mathrm{A}_{n})^{l}$ corresponds to its calculated attention score. $\mathrm{A}_{n}$ denotes the concatenation of all $(\mathrm{A}_{n})^{l}$ , resulting in a shape of ${{B}\times{H}\times{N}\times{d_{k}}}$ , which respectively indicate batch size, number of heads, number of Gaussians, and number of features per Gaussian.

For each attention score $(\mathrm{A}_{n})^{l}$ , we visualize the attention by assigning the score to RGB values. Thereby we obtain attention visualization colors $c_{att}$ for each Gaussian. The overall visualization of attention is then calculated such as:

(20)

C=\sum_{i=1}c_{i}{\alpha}^{\prime}_{i}\prod_{j=1}^{i-1}(1-{\alpha}^{\prime}_{j% }),

where $c_{i}$ represents the color associated with each Gaussian, determined by $c_{att}$ along the view direction. ${\alpha}_{i}^{\prime}$ is derived from the multiplication of the opacity $\alpha$ of the 3D Gaussian and its projected covariance $\Sigma^{\prime}$ . This mathematical formulation allows us to visually interpret the model’s focus within the generated representations, effectively highlighting the areas of greatest feature impact.

B.3. Visualization of triplane

Fig. 3 of the main paper visualizes the PCA analysis result of our multi-resolution triplane, showing the efficacy of using triplane to embed Gaussian features. We perform PCA on each triplane with dimensions ${H\times R\times R}$ , linearly transforming the first dimension down to three principal components, resulting in dimensions ${3\times R\times R}$ . Subsequently, the values of the first dimension are normalized between $[0,255]$ to denote RGB values. As a result, in all xy, yz, and zx triplanes, semantically close facial regions are consistently represented with similar colorations.

Appendix C Additional Experiments

C.1. Additional qualitative experiments.

We present additional visualization of generated keyframes from comparison experiments in the self-driven setting and the cross-driven setting in Fig. 6 and Fig. 7 respectively. These experiments showcase the stability of our method and its applicability to various identities.

C.2. User study

Following previous works (Tang et al., 2022; Li et al., 2023), we conducted a user study in order to better judge the visual quality of the generated talking head videos. 21 participants with an age range of 20-40 years old were solicited to evaluate the rendered results in the head reconstruction setting. For accurate judgments, we combine all generated videos into a single high-resolution video, enabling simultaneous observation of all movements by the participants. To ensure fairness in the comparison process, we assign a number to each generated result instead of identifying them by their method. Participants were asked to evaluate the three perspectives of the generated portraits: (1) Lip-sync Accuracy; (2) Video Realness; and (3) Image Quality. The results are shown in Tab. 7.

Table 7. User study results. The rating is of scale 1-5, the higher the better. The top, second-best, and third-best results are shown in red, orange, and yellow, respectively.

	self-driven			cross-driven
Methods	Lip-sync Accuracy	Image Quality	Video Realness	Lip-sync Accuracy	Image Quality	Video Realness
Wav2Lip (Prajwal et al., 2020)	$3.167$	$2.665$	$2.459$	$2.678$	$2.313$	$2.135$
PC-AVS (Zhou et al., 2021)	$2.625$	$1.896$	$1.921$	$1.958$	$1.292$	$1.229$
AD-NeRF (Guo et al., 2021)	$2.031$	$2.492$	$2.396$	$2.574$	$3.042$	$2.365$
RAD-NeRF (Tang et al., 2022)	$2.417$	$2.750$	$2.541$	$2.938$	$3.146$	$2.604$
ER-NeRF (Li et al., 2023)	$2.354$	$3.042$	$2.771$	$2.792$	$3.458$	$3.146$
GaussianTalker	$3.083$	$3.667$	$3.188$	$3.250$	$3.729$	$3.208$

Appendix D Ablation studies

D.1. Initialization of $\mu_{c}$

Our study explores the impact of initialization on canonical 3D Gaussian optimization. In the default setting, we leverage a pre-optimized Basel Face Model (Deng et al., 2019) to obtain camera parameters during preprocessing. These optimized mesh vertices are used to initialize the 3D positions, $\mu_{c}$ of the 3D Gaussians.

To investigate the impact of the proposed 3DMM-based initialization, we conduct an ablation study by comparing it to random initialization from a sphere. In Fig. 8, we visually analyze the optimization process of the canonical stage under both initialization settings. Our experiments demonstrate that utilizing 3DMM-based initialization leads to faster convergence, attributed to the facial depth information encoded in the initialized points.

D.2. Selection of attributes inferred for triplane embeddings.

In Fig. 9, we support the quantitative comparison in the main paper by presenting key frames of the rendered results. Conditioning the triplane embeddings on the structure information such as $r$ and $s$ tends to show less accurate facial details such as wrinkles in facial muscle. In contrast, while conditioning on appearance information $SH$ and $\alpha$ produce accurate reconstructions of the canonical head, the facial motion appears less dynamic compared to the ground truth, and does not correlate well with input speech audio.

D.3. Selection of deformed attributes.

We also provide qualitative comparisons from our ablation study on selection of Gaussian attributes to be deformed. Utilizing the same comparison settings from Sec.5.4.2, we visualize the rendered results in Fig. 10. Only deforming $SH$ and $\alpha$ show blurry results with unrealistic deformations, while only manipulating

D.4. Disentanglement of audio-unrelated motion

Finally, we reinforce the insights drawn from the quantitative analysis in Section 5.4.3. We elucidate the disentanglement of speech-related motion in Fig. 11 by presenting visualizations of the attention scores for the input conditions across the ablation experiment settings. Notably, the attention scores of the input speech audio become more widely distributed across other facial regions, indicating inadequate disentanglement of speech-related motion when solely provided with speech as the input condition.

Appendix E Supplementary Video

To comprehensively visualize the efficacy of our proposed method in the domain of talking facial video synthesis, we prepared a supplementary video. This video encompasses the results and analysis of our experiments presented in the main paper and the supplementary document. We showcase talking head videos generated under both the self-driven and cross-driven settings and compare them with previous NeRF-based works (Guo et al., 2021; Tang et al., 2022; Li et al., 2023). We also demonstrate the effectiveness of our spatial-audio cross attention module by showing how the attention scores of each condition evolve as the scene progresses. Lastly, the video includes a set of ablation studies that systematically examine the impact of each component of our proposed method.

Appendix F Further Discussions

F.1. Ethical Considerations

Our goal with GaussianTalker is to create realistic talking 3D heads for practical real-world applications like digital assistants and video production. However, its photorealism raises ethical concerns, as it’s difficult to distinguish real from synthetic videos. This can be used to create deepfakes, which are manipulated videos that can be used to spread misinformation or damage someone’s reputation. To address this, we propose several measures: 1) informing users about video authenticity, 2) sharing our results with deepfake detection communities to improve detection algorithms, and 3) advocating for digital watermarks in real videos to deter misuse. Finally, we believe responsible use requires clear regulations to govern deepfakes on social media, protecting users from potential manipulation.

F.2. Limitations and future work

GaussianTalker shares a common limitation with previous NeRF-based talking head synthesis methods: per-identity training. This restricts the model’s ability to generalize to new identities, making data preparation for audio and eye features time-consuming. Additionally, free-viewpoint rendering remains a challenge due to the lack of multi-view training data. While the deformation stage achieves high fidelity and generalizes well to out-of-domain audio, it struggles with extreme viewpoints. Our current approach uses limited canonical training for coarse structure, leading to inconsistencies when synthesizing from very different angles.

Future work will focus on overcoming these limitations. We aim to explore techniques for multi-identity training and efficient data pre-processing. Additionally, we will investigate methods for free-viewpoint rendering using techniques like multi-view data acquisition or neural rendering approaches.