Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation

Jiadong Liang, Feng Lu State Key Laboratory of VR Technology and Systems, School of CSE, Beihang UniversityBei**gChina ljdtc, [email protected]
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.

Digital Human, Talking Face Generation, AIGC
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXconference: ACM xx xx; xx; xx, xxisbn: 978-1-4503-XXXX-X/18/06copyright: noneccs: Computing methodologies Computer vision
Refer to caption
Figure 1. We propose an advanced two-step framework for synthesizing vivid emotional talking faces with emotionally aligned facial cues. Initially, our approach utilizes the provided driving audio and an emotion label to generate three sequences of fine-grained facial cues (expression, head pose, and gaze) tailored to the specified emotion. Subsequently, these facial cues align with the specified emotion through self-supervised learning. Finally, utilizing these facial cues, we can produce vivid emotional talking face videos.

1. Introduction

Table 1. A comparison of related works across various criteria is shown on the left.
Facial Cues Ours Wave2Lip MakeItTalk PC-AVS Audio2Head MEAD EVP EAMM EMMN Xu PD-FGC SPACE
Landmarks
Head Pose
Gaze
Emotion

The task of talking face generation involves creating a video of a talking face using a still identity image of the speaker and an audio track of their speech content. Furthermore, the generation of emotional talking face videos, featuring precise lip synchronization and vivid facial cues such as expressions, gaze, and head pose holds considerable potential for future applications. We found that these facial cue sequences exhibit consistent patterns across videos corresponding to specific emotions. For instance, in a talking face depicting contempt, individuals typically narrow their eyes, tilt their heads upward, and shift their gaze horizontally. Conversely, in videos portraying surprise, individuals generally widen their eyes while maintaining a forward-facing head pose and gaze. The alignment of these facial cues with emotions is crucial for synthesizing realistic talking face videos. Therefore, the primary challenge of this task is to produce realistic facial videos that not only accurately reproduce lip movements in response to the driving audio but also maintain consistency between the various facial cues and corresponding emotions.

With the development of AI-Generated Content (AIGC), numerous advanced methods for generating emotional talking face have emerged. Corresponding methods can be mainly divided into audio-driven talking face generation (Wang et al., 2020; Ji et al., 2022; Gururani et al., 2023) and video-driven talking face generation (Ji et al., 2021; Wang et al., 2023a). Audio-driven talking face generation synthesizes a sequence of portrait images by using both an identity image and the corresponding audio as inputs. However, most existing works focus exclusively on lip movement and often lack the ability to generate associated facial cues. The latest work, SPACE (Gururani et al., 2023), is a multi-stage generative framework that achieves fine-grained control over facial expressions, emotion categories, head poses, and gaze directions of generated faces by manipulating the intermediate landmarks. Unfortunately, in SPACE, the control and generation of gaze and head pose sequences are unrelated to emotion. Although the generated faces display high visual quality, the final videos lack vividness and fail to differentiate emotions effectively. Video-driven talking face generation involves using a single identity image and multiple driving source videos as inputs. By employing a contrastive learning strategy, relevant features are extracted from various source videos to generate high-quality target talking face videos. These methods allow for the editing of facial attributes, such as head pose, facial expression, gaze, blinking, and audio, by modifying the corresponding source videos. However, these methods are not only expensive in the training stage but also incur additional costs in the inference stage due to the need for source video collection. We find that the fine-grained facial cues of the talking face video, such as expression, gaze and head pose, are closely related to emotional categories. Independently controlling these facial cues without considering emotion can lead to unrealistic results. Therefore, we require an audio-driven method that combines the low cost of audio-driven methods with the high alignment between facial cues and emotion to synthesize vivid emotional talking face videos.

In this paper, we propose an advanced audio-driven method to synthesize vivid emotional talking faces with emotionally aligned facial cues. We decompose the task into two sub-tasks: speech-to-landmarks synthesis and landmarks-to-face generation. Given input speech, an emotion label, and an identity image, the proposed speech-to-landmarks module can generate sequences of normalized facial landmarks (representing expressions), gaze, and head pose in an auto-regressive manner. To address the issue of substantial variations in gaze, we discretized the eye region and modeled gaze prediction as a classification task, successfully predicting gaze sequences for the first time. The alignment of emotional labels with these facial cues is achieved through self-supervised learning. These generated facial cues are synthesized into relocated 3D facial landmarks through coordinate correction and rotation. The relocated facial landmarks, driven by these cues, not only synchronize with the input audio but also enhance the alignment of expression, gaze, and pose movements with the corresponding emotion labels. We also build a collaborative emotion classifier to model the intrinsic relationships among these facial cues. Specifically, the classifier takes aggregated intermediate features from facial cues as input ensuring consistency among these facial cue sequences. The proposed landmarks-to-face module utilizes a latent keypoints space(Siarohin et al., 2021), capable of producing more realistic faces compared to traditional facial landmarks. Specifically, this module maps the relocated landmarks, generated by the speech-to-landmarks, to latent feature points. It then employs a pre-trained generator(Wang et al., 2021b) to synthesize high-quality facial images.

To the best of our knowledge, this is the first attempt to align normalized facial landmarks, gaze, and head pose concurrently with emotional categories. Compared to existing works, these explicitly aligned facial cues significantly improve the intensity and accuracy of emotional expressions in generated talking faces. Additionally, other contributions are as follows:

  • We introduced an advanced speech-to-landmarks synthesis auto-regressive algorithm. This algorithm can generate emotionally aligned facial cues, including normalized facial landmarks, gaze, and head pose, in an auto-regressive manner.

  • We designed a specific eye region discretization strategy that efficiently generates gaze sequences by modeling gaze prediction as a classification task.

  • Extensive experiments on the MEAD(Wang et al., 2020) dataset demonstrate the superiority of the proposed method in terms of lip synchronization and the quality of generated faces.

2. Related Work

In the following section, we provide an overview of prior research on audio-driven talking face generation and video-driven talking face generation. Table 1 outlines a summary of the key differences between our approach and other state-of-the-art methods. Specifically, a solid circle indicates that the method can automatically generate the relevant attribute or be driven by other source videos to control the relevant attribute. A half-filled circle denotes that the related attribute can only be controlled by other source videos. Conversely, an empty circle represents that the method cannot generate or control the related attribute.

Audio-driven Talking Face Generation. The objective of Speech-driven Talking Face Generation (Karras et al., 2017; Taylor et al., 2017; Wang et al., 2021a; Zhou et al., 2019; Vougioukas et al., 2019; Liu et al., 2022; Wang et al., 2023b; Shen et al., 2023; Fan et al., 2022; Du et al., 2023; Wu et al., 2023) is to establish a map** from the input speech to facial representations. MakeItTalk(Zhou et al., 2020) disentangles audio content and speaker information to control lip motion and facial expressions, and effectively works with various portrait styles. Audio2Head(Wang et al., 2021a) addresses challenges in achieving natural head motion by employing a motion-aware RNN for head pose prediction. With the further development of this field, there is a growing emphasis on controlling facial emotions in generated talking faces. Wang collected the MEAD(Wang et al., 2020) dataset and proposed a method that conditions talking head generation based on emotion labels. However, MEAD primarily focuses on controlling only the mouth region while leaving other parts unchanged, leading to a lack of continuity in the generated videos. Compared to MEAD, which uses a single emotion label as input to control the generation of video emotion categories, EAMM(Ji et al., 2022) achieves precise emotional control over the synthesized video by adopting features extracted from the emotion source video. However, the method still ignores the movement of the gaze direction and head pose, which results in less realistic generated videos. EMMN(Tan et al., 2023) and (Xu et al., 2023) employ memory networks and textual prompts, respectively, to control the emotions in the generated videos. SPACE(Gururani et al., 2023) achieves fine-grained control over facial expressions, emotion categories, head poses, and gaze directions of generated faces by decomposing the generation task into multiple subtasks. While these methods are user-friendly and straightforward, they often struggle to generate emotionally aligned facial cues for generated videos.

Video-driven Talking Face Generation. The goal of video-driven talking face generation(Drobyshev et al., 2022; Gu et al., 2020; Hong et al., 2022; Wang et al., 2022; Burkov et al., 2020; Liu et al., 2021; Xiang et al., 2020) is to accurately map facial movements from a source video onto a target image. Wave2Lip(Prajwal et al., 2020) constructs an expert discriminator to ensure precise alignment between the lip movements and the input audio. PC-AVS(Zhou et al., 2021) utilizes an implicit low-dimension pose code to separate audio-visual representations to achieve accurate lip-syncing and pose control. EVP(Ji et al., 2021) decomposes speech into two decoupled spaces to generate dynamic 2D emotional facial landmarks. PD-FGC(Wang et al., 2023a) allows for the editing of facial attributes, such as head pose, facial expression, gaze, blinking, by modifying the corresponding source videos. These methods are not only costly during the training stage, but also entail additional expenses during the inference stage due to the requirement of source video collection.

Refer to caption
Figure 2. Architecture of the proposed method, which performs emotional talking face generation in two steps. In Step 1, we innovatively achieved the simultaneous generation of facial cue sequences, including normalized landmarks, gaze, and head pose. These cues are then aligned with emotional labels via self-supervised learning. In Step 2, we utilize the emotionally aligned facial cues from Step 1 as inputs, employing a pre-trained model to produce vivid emotional talking face videos.

3. Two-stage Talking Face Generation

Refer to caption
Figure 3. Pipeline of gaze direction discretization.

The proposed method takes the driving audio and a reference image, along with an emotion label, and produces an emotional talking face. It decomposes this task into two steps: (1) Speech-to-Landmarks Synthesis: Given a reference image, it extracts normalized landmarks, gaze labels, and head poses, and predicts their per-frame motions driven by the input speech and emotion label. Specifically, our proposed method innovatively generates cohesive sequences of normalized facial landmarks, gaze, and head pose simultaneously. Furthermore, we accomplish the collaborative alignment of these facial cues with the corresponding emotional labels using self-supervised learning. (2) Landmarks-to-Face Generation: In this step, the per-frame relocated facial landmarks are mapped to latent keypoints, which are then fed into the pre-trained model (Wang et al., 2021b) to generate the final emotional talking face. This decomposition offers multiple advantages. Firstly, it enables fine-grained control over the output facial expressions. Secondly, the two-stage training approach can reduce the training complexity and accelerate the convergence of each module. By leveraging a pretrained face generator, we can effectively reduce the training cost while obtaining high-quality emotional talking faces. The overall framework is shown in Fig. 2.

3.1. Speech-to-Landmarks Synthesis.

Given an input speech, emotion label, and an identity face image, the proposed speech-to-landmarks synthesis is capable of generating sequences of normalized facial landmarks (representing expressions), gaze, and head pose in an auto-regressive manner. These facial cues are further aligned with specific emotions through self-supervised learning. Specifically, the network for speech-to-landmarks synthesis consists of three modules: Landmark Sequentializer, Gaze Sequentializer, and Pose Sequentializer, each performing auto-regressive prediction on different facial cues sequences.

Landmark Sequentializer. In talking face generation, the quality of facial landmark generation is paramount as these landmarks directly control lip movements and expressions. Given the input audio MFCC features and the normalized 3D facial landmarks of the input image. This is represented by

(1) 𝐟nl=𝒮landmark(𝐂n1,𝐚n,𝐞),𝐂n=Linear(𝐟nl),\begin{split}\mathbf{f}_{n}^{l}=\mathcal{S}_{\text{landmark}}(\mathbf{C}_{n-1}% ,\mathbf{a}_{n},\mathbf{e}),\quad\mathbf{C}_{n}=\text{Linear}(\mathbf{f}_{n}^{% l}),\end{split}start_ROW start_CELL bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = caligraphic_S start_POSTSUBSCRIPT landmark end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_e ) , bold_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = Linear ( bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) , end_CELL end_ROW

where 𝐟nlsuperscriptsubscript𝐟𝑛𝑙\mathbf{f}_{n}^{l}bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and 𝐚nsubscript𝐚𝑛\mathbf{a}_{n}bold_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are the facial landmarks features and audio features at time n𝑛nitalic_n, respectively. 𝒮landmarksubscript𝒮landmark\mathcal{S}_{\text{landmark}}caligraphic_S start_POSTSUBSCRIPT landmark end_POSTSUBSCRIPT represents the auto-regressive generator, and 𝐂n147×3subscript𝐂𝑛superscript1473\mathbf{C}_{n}\in\mathbb{R}^{147\times 3}bold_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 147 × 3 end_POSTSUPERSCRIPT is the normalized 3D facial landmarks at time n𝑛nitalic_n. The input emotion label embedding 𝐞D𝐞superscript𝐷\mathbf{e}\in\mathbb{R}^{D}bold_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT corresponds to the indexed vectorial representation in the embedding matrix 𝐄D×K𝐄superscript𝐷𝐾\mathbf{E}\in\mathbb{R}^{D\times K}bold_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_K end_POSTSUPERSCRIPT for an emotion dictionary containing K𝐾Kitalic_K emotion categories, which is learned together with the whole model. We employ a convolutional neural network (CNN) to encode the audio data, while a multi-layer perceptron (MLP) is used to encode the 3D facial landmarks. The network is trained utilizing an L1 loss function, which reduces the L1 distance between the predicted facial landmarks and the normalized ground truth landmarks. Notably, we assign a higher loss scale to the y-axis to emphasize vertical motion errors during training (Gururani et al., 2023). Using facial landmarks as an intermediary representation is advantageous as it facilitates the explicit manipulation of facial features. For instance, by manipulating the landmarks of the eyes, it becomes possible to incorporate eye blinks into the face. We discovered that conducting predictions in the normalized space is crucial to simplify the map** between phonemes and lip motions.

Pose Sequentializer. The natural movement of the head can effectively enhance the vividness of the generated video, but its motion pattern is also influenced by the emotional category. Hence, we also train an auto-regressive 𝒮headposesubscript𝒮𝑒𝑎𝑑𝑝𝑜𝑠𝑒\mathcal{S}_{headpose}caligraphic_S start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT that predicts the rotation and translation for the facial landmarks. The rotation is represented by three angles: yaw, pitch, and roll, and the corresponding translation is the displacement of the 3D landmarks in the x-axis, y-axis, and z-axis. The prediction is represented by

(2)

𝐫n=Liner(𝐟nr),𝐫n=[mn,bn],𝐟nr=𝒮headpose(𝐫n1,𝐚n,𝐞),formulae-sequencesubscript𝐫𝑛Linersuperscriptsubscript𝐟𝑛𝑟formulae-sequencesubscript𝐫𝑛subscriptm𝑛subscriptb𝑛superscriptsubscript𝐟𝑛𝑟subscript𝒮𝑒𝑎𝑑𝑝𝑜𝑠𝑒subscript𝐫𝑛1subscript𝐚𝑛𝐞\begin{aligned} \mathbf{r}_{n}=\text{Liner}(\mathbf{f}_{n}^{r}),\quad\mathbf{r% }_{n}=[\textbf{m}_{n},\textbf{b}_{n}],\quad\mathbf{f}_{n}^{r}=\mathcal{S}_{% headpose}(\mathbf{r}_{n-1},\mathbf{a}_{n},\mathbf{e}),\end{aligned}start_ROW start_CELL bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = Liner ( bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) , bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] , bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT ( bold_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_e ) , end_CELL end_ROW

where mn=[yaw,pitch,roll]subscriptm𝑛𝑦𝑎𝑤𝑝𝑖𝑡𝑐𝑟𝑜𝑙𝑙\textbf{m}_{n}=[yaw,pitch,roll]m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ italic_y italic_a italic_w , italic_p italic_i italic_t italic_c italic_h , italic_r italic_o italic_l italic_l ], bn=[Δx,Δy,Δz]subscriptb𝑛subscriptΔ𝑥subscriptΔ𝑦subscriptΔ𝑧\textbf{b}_{n}=[\Delta_{x},\Delta_{y},\Delta_{z}]b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ roman_Δ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , roman_Δ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] and 𝒮headposesubscript𝒮headpose\mathcal{S}_{\text{headpose}}caligraphic_S start_POSTSUBSCRIPT headpose end_POSTSUBSCRIPT represents the pose sequentializer. The poses, whether predicted or extracted from a reference video, are applied to the frontal normalized landmarks predicted by our Landmark Sequentializer. This transformation maps the normalized landmarks back to the image space after applying an appropriate scaling factor. The Pose Sequentializer is also trained utilizing an L1 loss.

Gaze Sequentializer. The eyes, as important organs for human interaction, contain abundant information in gaze direction, which can impact the emotion category of generated videos. In contrast to the prediction of facial landmarks and head pose, we transform the prediction of gaze direction from regression to classification by discretizing the eye regions as shown in Fig 3. It effectively enhances the accuracy of gaze direction prediction. However, in the process of predicting the sequence of gaze directions, the current gaze direction heavily relies on the previous gaze.

Refer to caption
Figure 4. Qualitative comparison of generated normalized landmarks between our method and three other methods on the MEAD dataset.

We adopt 𝒮gazesubscript𝒮𝑔𝑎𝑧𝑒\mathcal{S}_{gaze}caligraphic_S start_POSTSUBSCRIPT italic_g italic_a italic_z italic_e end_POSTSUBSCRIPT to model this dependency relationship. Specifically, the gaze prediction of two eyes [unleft[0,S1],unright[1,S1]]delimited-[]formulae-sequencesuperscriptsubscriptu𝑛𝑙𝑒𝑓𝑡0𝑆1superscriptsubscriptu𝑛𝑟𝑖𝑔𝑡1𝑆1[\mathrm{u}_{n}^{left}\in[0,S-1],\mathrm{u}_{n}^{right}\in[1,S-1]][ roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT ∈ [ 0 , italic_S - 1 ] , roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ∈ [ 1 , italic_S - 1 ] ] at time n can be modeled as:

(3) 𝐟ng=𝒮gaze(vn1,𝐚n,𝐞).,vn=unleft+S×unright,\begin{split}\mathbf{f}_{n}^{g}=\mathcal{S}_{gaze}(\mathrm{v}_{n-1},\mathbf{a}% _{n},\mathbf{e}).,\quad\mathrm{v}_{n}=\mathrm{u}_{n}^{left}+S\times\mathrm{u}_% {n}^{right},\end{split}start_ROW start_CELL bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_g italic_a italic_z italic_e end_POSTSUBSCRIPT ( roman_v start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_e ) . , roman_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT + italic_S × roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT , end_CELL end_ROW

Formally, gaze decoder performs classification at n𝑛nitalic_n-th time step based on the hidden states 𝐟ngsuperscriptsubscript𝐟𝑛𝑔\mathbf{f}_{n}^{g}bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT by

(4) vn=argmax(𝐩ni)i[1,S×S],𝐩n=Softmax(𝐌𝐟ng),\begin{split}v_{n}=\underset{i\in[1,S\times S]}{\text{argmax}(\mathbf{p}_{n}^{% i})},\quad\mathbf{p}_{n}=\text{Softmax}(\mathbf{M}\mathbf{f}_{n}^{g}),\end{split}start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = start_UNDERACCENT italic_i ∈ [ 1 , italic_S × italic_S ] end_UNDERACCENT start_ARG argmax ( bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG , bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = Softmax ( bold_Mf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ) , end_CELL end_ROW

where 𝐌𝐌\mathbf{M}bold_M denotes a linear transformation. 𝐩nS×Ssubscript𝐩𝑛superscript𝑆𝑆\mathbf{p}_{n}\in\mathbb{R}^{S\times S}bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_S × italic_S end_POSTSUPERSCRIPT represents calculated probabilities for a total of S×S𝑆𝑆S\times Sitalic_S × italic_S classification entries. vnsubscript𝑣𝑛v_{n}italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the entry with the maximal probability, from which we can infer the corresponding gaze label [unleft,unright]superscriptsubscriptu𝑛𝑙𝑒𝑓𝑡superscriptsubscriptu𝑛𝑟𝑖𝑔𝑡[\mathrm{u}_{n}^{left},\mathrm{u}_{n}^{right}][ roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT , roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ]. Finally, we utilize the cross-entropy loss to optimize the gaze direction prediction model.

(5) Lossgaze=1Nn=1NLossCE(𝐩n,𝐩n^).subscriptLoss𝑔𝑎𝑧𝑒1𝑁superscriptsubscript𝑛1𝑁subscriptLossCEsubscript𝐩𝑛^subscript𝐩𝑛\begin{split}\mathrm{Loss}_{gaze}=\frac{1}{N}\sum_{n=1}^{N}\mathrm{Loss}_{% \text{CE}}(\mathbf{p}_{n},\hat{\mathbf{p}_{n}}).\end{split}start_ROW start_CELL roman_Loss start_POSTSUBSCRIPT italic_g italic_a italic_z italic_e end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Loss start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , over^ start_ARG bold_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) . end_CELL end_ROW

After obtaining these three types of facial cues, we integrate the gaze and head pose data into the normalized landmarks to obtain the relocated landmarks as shown in Fig. 2. We also investigate the inherent relationships within these facial cues in the specific emotion. Therefore, we construct a collaborative emotion classifier to push consistency among these facial cue sequences. The classifier predicts the emotion category using the aggregated intermediate features from facial cues as input. Specifically, we can obtain the fake intermediate features 𝐟n^=[𝐟nl^;𝐟nr^;𝐟ng^]^subscript𝐟𝑛^superscriptsubscript𝐟𝑛𝑙^superscriptsubscript𝐟𝑛𝑟^superscriptsubscript𝐟𝑛𝑔\hat{\mathbf{f}_{n}}=[\hat{\mathbf{f}_{n}^{l}};\hat{\mathbf{f}_{n}^{r}};\hat{% \mathbf{f}_{n}^{g}}]over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = [ over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG ; over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG ; over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG ], where [𝐟nl^;𝐟nr^;𝐟ng^]^superscriptsubscript𝐟𝑛𝑙^superscriptsubscript𝐟𝑛𝑟^superscriptsubscript𝐟𝑛𝑔[\hat{\mathbf{f}_{n}^{l}};\hat{\mathbf{f}_{n}^{r}};\hat{\mathbf{f}_{n}^{g}}][ over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG ; over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG ; over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT end_ARG ] is predicted during the training stage.

(6) 𝐥n^=classify(𝐟n^),LossC=LCE(𝐥n^,𝐥n)\displaystyle\begin{split}\hat{\mathbf{l}_{n}}=\mathcal{F}_{classify}(\hat{% \mathbf{f}_{n}}),\quad\mathrm{Loss}_{C}=\mathrm{L}_{\text{CE}}(\hat{\mathbf{l}% _{n}},\mathbf{l}_{n})\end{split}start_ROW start_CELL over^ start_ARG bold_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG = caligraphic_F start_POSTSUBSCRIPT italic_c italic_l italic_a italic_s italic_s italic_i italic_f italic_y end_POSTSUBSCRIPT ( over^ start_ARG bold_f start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) , roman_Loss start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = roman_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT ( over^ start_ARG bold_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG , bold_l start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_CELL end_ROW

where LCEsubscriptLCE\mathrm{L}_{\text{CE}}roman_L start_POSTSUBSCRIPT CE end_POSTSUBSCRIPT is cross entropy loss and 𝐩tsubscript𝐩𝑡\mathbf{p}_{t}bold_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is calculated probabilities for total emotion classification entries. The total loss for the stage of speech-to-landmarks synthesis is

(7)

Lossnorm=Losslandmarks+Losspose+Lossgaze+LossC.\begin{aligned} \mathrm{Loss}_{norm}=\quad\mathrm{Loss}_{landmarks}+\mathrm{% Loss}_{pose}+\mathrm{Loss}_{gaze}+\mathrm{Loss}_{C}.\end{aligned}start_ROW start_CELL roman_Loss start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m end_POSTSUBSCRIPT = roman_Loss start_POSTSUBSCRIPT italic_l italic_a italic_n italic_d italic_m italic_a italic_r italic_k italic_s end_POSTSUBSCRIPT + roman_Loss start_POSTSUBSCRIPT italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT + roman_Loss start_POSTSUBSCRIPT italic_g italic_a italic_z italic_e end_POSTSUBSCRIPT + roman_Loss start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT . end_CELL end_ROW

Refer to caption
Figure 5. Comparison of left eye gaze distribution between our model and state-of-the-art models across different emotion categories.

In fact, the generation of all facial cues incorporates emotional labels as input. By minimizing the discrepancy between the generated facial cues and the ground truth detected by off-the-shelf detectors (see Sec.4.2), we achieve alignment of expressions, gazes, and head poses with emotion labels through self-supervised learning.

3.2. Landmarks-to-Face Generation.

The field of face generation has undergone rapid development, we find that using latent keypoints as input can generate high-quality facial images. Following SPACE(Gururani et al., 2023), we utilized the pre-trained model face-vid2vid (Wang et al., 2021b), a state-of-the-art framework for generating faces from latent keypoints. This approach avoids the need to learn a facial image generator from scratch, thereby reducing computational requirements and improving efficiency. Specifically, we first use the relocated landmarks in the 3D space generated in the speech-to-landmarks synthesis stage, along with the input speech and emotion labels, as input to perform autoregressive prediction of latent keypoints in a self-supervised learning manner, as shown in Fig. 2.

Refer to caption
Figure 6. Visualization of the head pose sequences in the pitch, yaw, and roll directions under different emotions.
(8)

𝐑n=relocate(𝐂n𝐦n+𝐛n,unleft,unright),𝐊n=𝒮Key(𝐊n1,𝐑n,𝐚n,𝐞),missing-subexpressionsubscript𝐑𝑛subscript𝑟𝑒𝑙𝑜𝑐𝑎𝑡𝑒subscript𝐂𝑛subscript𝐦𝑛subscript𝐛𝑛superscriptsubscriptu𝑛𝑙𝑒𝑓𝑡superscriptsubscriptu𝑛𝑟𝑖𝑔𝑡missing-subexpressionsubscript𝐊𝑛subscript𝒮𝐾𝑒𝑦subscript𝐊𝑛1subscript𝐑𝑛subscript𝐚𝑛𝐞\begin{aligned} &\mathbf{R}_{n}=\mathcal{F}_{relocate}(\mathbf{C}_{n}\cdot% \mathbf{m}_{n}+\mathbf{b}_{n},\mathrm{u}_{n}^{left},\mathrm{u}_{n}^{right}),\\ &\mathbf{K}_{n}=\mathcal{S}_{Key}(\mathbf{K}_{n-1},\mathbf{R}_{n},\mathbf{a}_{% n},\mathbf{e}),\\ \end{aligned}start_ROW start_CELL end_CELL start_CELL bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_F start_POSTSUBSCRIPT italic_r italic_e italic_l italic_o italic_c italic_a italic_t italic_e end_POSTSUBSCRIPT ( bold_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT , roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL bold_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = caligraphic_S start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT ( bold_K start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_e ) , end_CELL end_ROW

where the relocatesubscript𝑟𝑒𝑙𝑜𝑐𝑎𝑡𝑒\mathcal{F}_{relocate}caligraphic_F start_POSTSUBSCRIPT italic_r italic_e italic_l italic_o italic_c italic_a italic_t italic_e end_POSTSUBSCRIPT converts the normalized 3D facial landmarks 𝐂n147×3subscript𝐂𝑛superscript1473\mathbf{C}_{n}\in\mathbb{R}^{147\times 3}bold_C start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 147 × 3 end_POSTSUPERSCRIPT to relocated facial landmarks 𝐑n147×3subscript𝐑𝑛superscript1473\mathbf{R}_{n}\in\mathbb{R}^{147\times 3}bold_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 147 × 3 end_POSTSUPERSCRIPT using the gaze label [unleft,unright]superscriptsubscriptu𝑛𝑙𝑒𝑓𝑡superscriptsubscriptu𝑛𝑟𝑖𝑔𝑡[\mathrm{u}_{n}^{left},\mathrm{u}_{n}^{right}][ roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_f italic_t end_POSTSUPERSCRIPT , roman_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_i italic_g italic_h italic_t end_POSTSUPERSCRIPT ] and head pose 𝐫n=[mn,bn]subscript𝐫𝑛subscriptm𝑛subscriptb𝑛\mathbf{r}_{n}=[\textbf{m}_{n},\textbf{b}_{n}]bold_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = [ m start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] obtained in the stage of speech-to-landmarks. 𝒮Keysubscript𝒮𝐾𝑒𝑦\mathcal{S}_{Key}caligraphic_S start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT is auto-regression generator for 3D key points. Then, given the predicted latent keypoints 𝐊n10×3subscript𝐊𝑛superscript103\mathbf{K}_{n}\in\mathbb{R}^{10\times 3}bold_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 × 3 end_POSTSUPERSCRIPT and the initial face, the pre-trained model generates high-quality facial images by using a flow-based war** field as intermediate variables. Finally, we can generate high-quality facial images with a resolution of 256, which is generally superior to previous works. By breaking down the generation of 3D facial landmarks into the collaborative production of three facial cues, we have simplified the task. Consequently, we selected a lightweight Bi-LSTM as the backbone for all auto-regressive generators (𝒮lanmarksubscript𝒮𝑙𝑎𝑛𝑚𝑎𝑟𝑘\mathcal{S}_{lanmark}caligraphic_S start_POSTSUBSCRIPT italic_l italic_a italic_n italic_m italic_a italic_r italic_k end_POSTSUBSCRIPT, 𝒮headposesubscript𝒮𝑒𝑎𝑑𝑝𝑜𝑠𝑒\mathcal{S}_{headpose}caligraphic_S start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d italic_p italic_o italic_s italic_e end_POSTSUBSCRIPT, 𝒮gazesubscript𝒮𝑔𝑎𝑧𝑒\mathcal{S}_{gaze}caligraphic_S start_POSTSUBSCRIPT italic_g italic_a italic_z italic_e end_POSTSUBSCRIPT, and 𝒮Keysubscript𝒮𝐾𝑒𝑦\mathcal{S}_{Key}caligraphic_S start_POSTSUBSCRIPT italic_K italic_e italic_y end_POSTSUBSCRIPT), enabling the generation of high-quality facial cues and 3D keypoints with minimal computational expense.

This work fundamentally differs from SPACE (Gururani et al., 2023). SPACE focuses on precise control of the generated faces but neglects the alignment of related facial cues with emotions. Consequently, this oversight results in low differentiation among emotion categories and diminished vividness in the generated emotional talking faces. Conversely, our research prioritizes maintaining consistency between the facial cues and the driving emotion labels in generated emotional talking face videos. Furthermore, utilizing our proposed eye region discretization strategy, we have successfully generated gaze sequences for the first time.

4. Experiment

4.1. Implementation Details.

The videos were sampled at a rate of 30 frames per second (FPS), while the audio was pre-processed to 16 kHz. To extract audio features, we computed 28-dimensional MFCC using a window size of 30. We trained and evaluated our method using the MEAD dataset, an audio-visual emotional dataset comprising 60 actors/actresses and eight different emotion categories.

4.2. Dataset Preprocessing.

The proposed method adopts a self-supervised learning approach in both Speech-to-Landmarks Synthesis and Landmarks-to-Face Generation. To achieve this, we preprocess the emotional talking face videos to obtain training pairs by extracting the facial landmarks and latent keypoints for each frame. Variations in head poses within videos lead to a greater diversity of landmark movements, potentially diminishing prediction accuracy. Consequently, our work performs face normalization. Specifically, videos are aligned by centering on the nose tip and resized to a uniform resolution of 256 × 256 pixels, a data processing approach prevalent in related works. (Chen et al., 2020, 2018; Zhou et al., 2019). Given a talking-head video, we first extract per-frame facial landmarks and head poses. We adopt the Mediapipe (Lugaresi et al., 2019) landmark detector to extract 478 3D facial landmarks from each frame.To reduce computational costs and enhance model inference speed, we have selected a total of 147 facial landmarks for training while retaining the landmarks in the eye area. The per-frame head pose is obtained by the 3DDFA (Guo et al., 2020) landmark detector. We then rotate the face such that the nose tip faces straight towards the camera, aligned with the camera axis. The per-frame frontalized 3D facial landmarks will be scaled to obtain normalized landmarks with the same facial width.

The majority of existing speech-driven talking face generation methods have neglected the generation of gaze direction sequences, resulting in individuals in the generated videos maintaining their gaze forward. This omission weakens the authenticity of the generated videos. In this paper, to achieve more effective prediction of gaze direction, we discretize gaze direction and model the prediction as a classification problem rather than a regression problem. Specifically, we divide each eye area into 10 regions based on facial landmarks. During the training process, we directly predict into which region the pupil will fall. The pipeline of gaze discretization is illustrated in Fig. 3.” In addition to landmarks, we also extract latent keypoints per frame using the pretrained face-vid2vid encoder (Wang et al., 2021b). This provides per-frame pairs of (facial landmarks, latent keypoints).

4.3. Evaluation Metrics.

To evaluate the alignment between the generated face and the input audio, we calculate the Euclidean distance of facial landmarks between the generated images and the ground truth images in the mouth region (MLD (Chen et al., 2019)). We also evaluate the accuracy of facial expressions by measuring the landmarks difference on the whole face (FLD). We use the confidence scores of SyncNet (Chung and Zisserman, 2017) to evaluate the consistency between the generated face and the driving audio at the feature level. For the visual quality of the synthesized face, we use Structural Similarity (SSIM), Peak Signal to Noise Ratio (PSNR), and Frechet Inception Distances (FID) (Heusel et al., 2017) for quantitative analysis of the generated results. Additionally, we need to conduct a quantitative assessment of the effectiveness of gaze and head pose in talking face videos. Given the diversity of these facial cues, calculating their frame-by-frame differences from the ground truth is not meaningful. Therefore, we adopt Dynamic Time War** (DTW) to measure the similarity between the generated facial cues sequences and GT. Specifically, for head pose, we separately calculate the DTW for Pitch, Yaw, and Roll sequences. For gaze, we calculate the DTW for the pupil movement speed sequences.

Refer to caption
Figure 7. Qualitative comparison of our model and other state-of-the-art models for emotional talking face generation.
Table 2. Performance of ablation study in DTW, including the absence of Gaze Sequentializer, denoted as Ours w/o Gaze, and the absence of Pose Sequentializer, denoted as Ours w/o Pose.
{tabu}

l—ccccc Method Pitch \downarrow Yaw \downarrow Roll \downarrow Left eye \downarrow Right eye \downarrow
MEAD(Wang et al., 2020) 0.62 0.32 0.53 0.85 0.91
EVP(Ji et al., 2021) 0.74 0.77 0.62 0.88 0.95
EAMM(Ji et al., 2022) 0.76 0.74 0.31 0.81 0.88
Ours w/o Gaze 0.58 0.38 0.33 0.83 0.89
Ours w/o Pose 0.70 0.65 0.42 0.68 0.71
Ours w/o C 0.55 0.42 0.27 0.64 0.70
Ours 0.53 0.31 0.18 0.62 0.68

4.4. Evaluation of Facial Cues.

Evaluation of Normalized Landmarks. The accuracy of the generated facial landmarks critically influences the alignment of lip movements with the driving audio and the congruence between expressions and the corresponding emotion labels. The normalization of facial landmarks removes the influence of different head poses on the movement direction of facial landmarks, allowing our method to focus on predicting the vertical motions of the landmarks in the mouth and eye regions, which is crucial for prediction accuracy. We provide a qualitative comparison between the proposed Landmark Sequentializer and state-of-the-art methods for emotional talking face generation in Fig. 4. We found that only our method can effectively predict eye blinks and generate lips with better alignment (see the red boxes), which is consistent with the landmarks errors presented in Tab. 4.4

Table 3. Quantitative results of different talking face generation models on MEAD dataset.
{tabu}

l—cccccc Method MLD \downarrow FLD \downarrow SynNet \uparrow SSIM \uparrow PSNR \uparrow FID \downarrow
ATVG(Chen et al., 2019) 3.14 3.87 2.24 0.57 28.58 67.6
SDA(Vougioukas et al., 2019) 3.99 4.5 1.88 0.44 28.54 -
Wave2Lip(Prajwal et al., 2020) 3.43 3.80 2.24 0.57 29.03 -
MakeItTalk(Zhou et al., 2020) 3.80 3.92 2.20 0.56 28.92 -
PC-AVS(Zhou et al., 2021) 2.97 2.74 2.10 0.60 29.02 -
Song(Song et al., 2022) 2.54 3.49 - 0.64 29.11 36.33
MEAD(Wang et al., 2020) 2.52 3.16 - 0.68 28.61 22.52
EVP(Ji et al., 2021) 2.45 3.01 - 0.71 29.53 7.99
EAMM(Ji et al., 2022) 2.41 2.55 2.26 0.66 29.29 -
Xu(Xu et al., 2023) 2.31 - 3.57 0.75 30.10 15.89
EMMN(Tan et al., 2023) 2.78 2.87 3.57 0.66 29.38 -
Ours w/o C 2.21 2.11 4.53 0.69 30.14 9.12
Ours 2.08 1.99 4.72 0.74 30.98 8.62

Evaluation of Head Pose. The generation and evaluation of head pose sequences are still challenges in this task of audio-driven talking face generation. Due to the complex relationship between driving audio and the resultant head pose sequences, which are not mapped one-to-one, quantitatively assessing head pose generation is inherently difficult. Therefore, we randomly selected generated videos across various emotional categories. Then, we plotted the pitch, yaw, and roll of the generated head pose sequences and corresponding ground truth (GT) as line charts, as shown in Fig. 6. Across different emotion categories, the generated head pose sequences exhibit similar trends to the ground truth (GT), demonstrating that the head poses generated by our proposed method can translate into emotionally aligned head movements. Additionally, the quantitative results presented in Tab. 4.3 corroborate the effectiveness of the Pose Sequentializer, consistent with trends shown in Fig. 6.

Evaluation of Gaze. The direction of gaze is a critical facial cue in emotional talking face generation, significantly influencing the vividness of the generated video. Based on the gaze discretization strategy mentioned in Sec. 4.2, we verify the effectiveness of the proposed Gaze Sequentializer by analyzing the spatial distribution of the pupils in the eye region. As shown in Fig. 5, pie charts were used to represent the gaze distribution of different methods across various emotion categories. We found that MEAD (Wang et al., 2020) and EAMM (Ji et al., 2022) exhibit identical gaze distributions across different emotion categories, with the majority of the pupils falling within the eye region ”5”. It implies that these two methods are incapable of generating available gaze sequences. The EVP (Ji et al., 2021) can only generate gaze distributions similar to the ground truth (GT) in the ”happy” category, while it produces random gaze distributions in other emotion categories. Our method is capable of generating gaze distributions similar to the GT across all categories of emotions. In addition, we have further validated the effectiveness of the Gaze Sequentializer by calculating the DTW of pupil movement speed sequences as shown in Tab. 4.3. These results demonstrate that the gaze sequences generated by our method effectively align with the corresponding emotional labels. To the best of our knowledge, we are the first speech-driven talking face generation method capable of generating vivid gaze sequences related to emotion categories.

4.5. Comparison with State-of-the-art Methods.

Quantitative Evaluation. To conduct a comprehensive evaluation, we performed quantitative comparisons between our method and other approaches in both emotional and non-emotional talking face generation. Tab. 4.4 reports the quantitative experimental results. Apart from FID and SSIM, our method outperforms other approaches significantly in terms of MLD, FLD, and the confidence of SynNet, owing to the joint contributions from all three modules we proposed. The notable performance of EVP (Ji et al., 2021) on the FID metric primarily results from their substantial investment in independently training face generation models for each individual. The improvements in MLD, FLD and SynNet shows that our approach can generate more accurate facial landmarks and achieve better lip alignment, consistent with the qualitative results shown in Fig. 4. The results of SSIM, PSNR, and FID further confirm that the proposed method can generate high-quality facial images. We also conducted an ablation study to validate the efficacy of the collaborative emotion classifier. The results, as presented in Tab. 4.4 and 4.3, demonstrate that the absence of the emotion classifier, denoted as ’Ours w/o C’, results in a significant decline in all quantitative metrics. This decrease is primarily attributed to the intrinsic connection established by the emotion classifier among landmarks, gaze, and head pose, enhancing their consistency.

Qualitative Evaluation. Fig. 7 presents the qualitative comparison results of our method with other state-of-the-art emotional talking face generation methods. Due to the high alignment between facial cues and emotion categories, our method significantly outperforms other approaches in terms of normalized landmarks, gaze, and head poses. Specifically, our method can generate lip shapes that are more consistent with the input speech, attributable to the high precision of normalized landmark generation. Additionally, the natural variations in head pose and gaze in the generated face sequences further demonstrate the effectiveness of the proposed method. Relevant comparison videos are provided in the supplementary materials.

Refer to caption
Figure 8. User study results of talking face video generation quality and emotion accuracy.

User Study. We conducted a user study to compare our method with three SOTA models: MEAD, EVP, and EAMM. Specifically, we randomly selected five videos from each emotion category in the MEAD test set and evaluated both the overall quality and the accuracy of emotion expression in the generated videos. Human subjects were asked to vote for the video with the highest overall quality and the most accurate emotional expression. The rank-1 ratio for each method is presented in Fig. 8. Our method achieves 52% emotion accuracy and 57.88% overall quality from 20 collected human subjects, significantly outperforming other methods.

5. Conclusion

We proposed a two-step framework to generate vivid talking faces with emotionally aligned facial cues. In the first step, we aligned facial cues, including normalized facial landmarks, gaze, and head pose, with corresponding emotion labels in a self-supervised learning manner. In the second step, we adopted latent key points as intermediate variables and utilized a pre-trained generative model to map various facial cues into high-quality facial images. Extensive experiments on the MEAD dataset demonstrate that our model advances the state-of-the-art performance significantly.

References

  • (1)
  • Burkov et al. (2020) Egor Burkov, Igor Pasechnik, Artur Grigorev, and Victor Lempitsky. 2020. Neural head reenactment with latent pose descriptors. In CVPR. 13786–13795.
  • Chen et al. (2020) Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, and Chenliang Xu. 2020. Talking-head generation with rhythmic head motion. In ECCV. Springer, 35–51.
  • Chen et al. (2018) Lele Chen, Zhiheng Li, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2018. Lip movements generation at a glance. In Proceedings of the European conference on computer vision (ECCV). 520–535.
  • Chen et al. (2019) Lele Chen, Ross K Maddox, Zhiyao Duan, and Chenliang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In CVPR. 7832–7841.
  • Chung and Zisserman (2017) Joon Son Chung and Andrew Zisserman. 2017. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV. Springer, 251–263.
  • Drobyshev et al. (2022) Nikita Drobyshev, Jenya Chelishev, Taras Khakhulin, Aleksei Ivakhnenko, Victor Lempitsky, and Egor Zakharov. 2022. Megaportraits: One-shot megapixel neural head avatars. In ACM MM. 2663–2671.
  • Du et al. (2023) Chenpeng Du, Qi Chen, Tianyu He, Xu Tan, Xie Chen, Kai Yu, Sheng Zhao, and Jiang Bian. 2023. Dae-talker: High fidelity speech-driven talking face generation with diffusion autoencoder. In Proceedings of the 31st ACM International Conference on Multimedia. 4281–4289.
  • Fan et al. (2022) Yingruo Fan, Zhaojiang Lin, Jun Saito, Wen** Wang, and Taku Komura. 2022. Faceformer: Speech-driven 3d facial animation with transformers. In CVPR. 18770–18780.
  • Gu et al. (2020) Kuangxiao Gu, Yuqian Zhou, and Thomas Huang. 2020. Flnet: Landmark driven fetching and learning network for faithful talking facial animation synthesis. In AAAI, Vol. 34. 10861–10868.
  • Guo et al. (2020) Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, and Stan Z Li. 2020. Towards fast, accurate and stable 3d dense face alignment. In ECCV. Springer, 152–168.
  • Gururani et al. (2023) Siddharth Gururani, Arun Mallya, Ting-Chun Wang, Rafael Valle, and Ming-Yu Liu. 2023. Space: Speech-driven portrait animation with controllable expression. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 20914–20923.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NIPS 30 (2017).
  • Hong et al. (2022) Fa-Ting Hong, Longhao Zhang, Li Shen, and Dan Xu. 2022. Depth-aware generative adversarial network for talking head video generation. In CVPR. 3397–3406.
  • Ji et al. (2022) Xinya Ji, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Wayne Wu, Feng Xu, and Xun Cao. 2022. Eamm: One-shot emotional talking face via audio-based emotion-aware motion model. In ACM SIGGRAPH. 1–10.
  • Ji et al. (2021) Xinya Ji, Hang Zhou, Kaisiyuan Wang, Wayne Wu, Chen Change Loy, Xun Cao, and Feng Xu. 2021. Audio-driven emotional video portraits. In CVPR. 14080–14089.
  • Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, Antti Herva, and Jaakko Lehtinen. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. TOG 36, 4 (2017), 1–12.
  • Liu et al. (2021) ** Liu, Peng Chen, Tao Liang, Zhaoxing Li, Cai Yu, Shuqiao Zou, Jiao Dai, and Jizhong Han. 2021. Li-net: Large-pose identity-preserving face reenactment network. In ICME. IEEE, 1–6.
  • Liu et al. (2022) Xian Liu, Yinghao Xu, Qianyi Wu, Hang Zhou, Wayne Wu, and Bolei Zhou. 2022. Semantic-aware implicit neural audio-driven video portrait generation. In European Conference on Computer Vision. Springer, 106–125.
  • Lugaresi et al. (2019) Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).
  • Prajwal et al. (2020) KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In ACM MM. 484–492.
  • Shen et al. (2023) Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, and Jiwen Lu. 2023. DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1982–1991.
  • Siarohin et al. (2021) Aliaksandr Siarohin, Oliver J Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. 2021. Motion representations for articulated animation. In CVPR. 13653–13662.
  • Song et al. (2022) Linsen Song, Wayne Wu, Chen Qian, Ran He, and Chen Change Loy. 2022. Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security 17 (2022), 585–598.
  • Tan et al. (2023) Shuai Tan, Bin Ji, and Ye Pan. 2023. Emmn: Emotional motion memory network for audio-driven emotional talking face generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22146–22156.
  • Taylor et al. (2017) Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A deep learning approach for generalized speech animation. TOG 36, 4 (2017), 1–11.
  • Vougioukas et al. (2019) Konstantinos Vougioukas, Stavros Petridis, and Maja Pantic. 2019. End-to-end speech-driven facial animation with temporal GANs. In CVPR Workshop.
  • Wang et al. (2023a) Duomin Wang, Yu Deng, Zixin Yin, Heung-Yeung Shum, and Baoyuan Wang. 2023a. Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis. In CVPR. 17979–17989.
  • Wang et al. (2023b) Jiayu Wang, Kang Zhao, Shiwei Zhang, Yingya Zhang, Yujun Shen, Deli Zhao, and **gren Zhou. 2023b. LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13844–13853.
  • Wang et al. (2020) Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wayne Wu, Chen Qian, Ran He, Yu Qiao, and Chen Change Loy. 2020. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In ECCV. Springer, 700–717.
  • Wang et al. (2021a) Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, and Xin Yu. 2021a. Audio2head: Audio-driven one-shot talking-head generation with natural head motion. IJCAI (2021).
  • Wang et al. (2021b) Ting-Chun Wang, Arun Mallya, and Ming-Yu Liu. 2021b. One-shot free-view neural talking-head synthesis for video conferencing. In CVPR. 10039–10049.
  • Wang et al. (2022) Yaohui Wang, Di Yang, Francois Bremond, and Antitza Dantcheva. 2022. Latent image animator: Learning to animate images via latent space navigation. ICLR.
  • Wu et al. (2023) Haozhe Wu, Songtao Zhou, Jia Jia, Junliang Xing, Qi Wen, and Xiang Wen. 2023. Speech-Driven 3D Face Animation with Composite and Regional Facial Movements. In Proceedings of the 31st ACM International Conference on Multimedia. 6822–6830.
  • Xiang et al. (2020) Sitao Xiang, Yuming Gu, Pengda Xiang, Mingming He, Koki Nagano, Haiwei Chen, and Hao Li. 2020. One-shot identity-preserving portrait reenactment. arXiv preprint arXiv:2004.12452 (2020).
  • Xu et al. (2023) Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, and Yong Liu. 2023. High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6609–6619.
  • Zhou et al. (2019) Hang Zhou, Yu Liu, Ziwei Liu, ** Luo, and Xiaogang Wang. 2019. Talking face generation by adversarially disentangled audio-visual representation. In AAAI, Vol. 33. 9299–9306.
  • Zhou et al. (2021) Hang Zhou, Yasheng Sun, Wayne Wu, Chen Change Loy, Xiaogang Wang, and Ziwei Liu. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In CVPR. 4176–4186.
  • Zhou et al. (2020) Yang Zhou, Xintong Han, Eli Shechtman, Jose Echevarria, Evangelos Kalogerakis, and Dingzeyu Li. 2020. Makelttalk: speaker-aware talking-head animation. TOG 39, 6 (2020), 1–15.