Emotional Conversation: Empowering Talking Faces with Cohesive Expression, Gaze and Pose Generation
Abstract.
Vivid talking face generation holds immense potential applications across diverse multimedia domains, such as film and game production. While existing methods accurately synchronize lip movements with input audio, they typically ignore crucial alignments between emotion and facial cues, which include expression, gaze, and head pose. These alignments are indispensable for synthesizing realistic videos. To address these issues, we propose a two-stage audio-driven talking face generation framework that employs 3D facial landmarks as intermediate variables. This framework achieves collaborative alignment of expression, gaze, and pose with emotions through self-supervised learning. Specifically, we decompose this task into two key steps, namely speech-to-landmarks synthesis and landmarks-to-face generation. The first step focuses on simultaneously synthesizing emotionally aligned facial cues, including normalized landmarks that represent expressions, gaze, and head pose. These cues are subsequently reassembled into relocated facial landmarks. In the second step, these relocated landmarks are mapped to latent key points using self-supervised learning and then input into a pretrained model to create high-quality face images. Extensive experiments on the MEAD dataset demonstrate that our model significantly advances the state-of-the-art performance in both visual quality and emotional alignment.
1. Introduction
Facial Cues | Ours | Wave2Lip | MakeItTalk | PC-AVS | Audio2Head | MEAD | EVP | EAMM | EMMN | Xu | PD-FGC | SPACE |
Landmarks | ⚫ | ❍ | ⚫ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ⚫ |
Head Pose | ⚫ | ◗ | ⚫ | ◗ | ◗ | ❍ | ◗ | ❍ | ❍ | ❍ | ◗ | ⚫ |
Gaze | ⚫ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ❍ | ◗ | ◗ |
Emotion | ⚫ | ❍ | ❍ | ❍ | ❍ | ⚫ | ⚫ | ⚫ | ⚫ | ⚫ | ◗ | ⚫ |
The task of talking face generation involves creating a video of a talking face using a still identity image of the speaker and an audio track of their speech content. Furthermore, the generation of emotional talking face videos, featuring precise lip synchronization and vivid facial cues such as expressions, gaze, and head pose holds considerable potential for future applications. We found that these facial cue sequences exhibit consistent patterns across videos corresponding to specific emotions. For instance, in a talking face depicting contempt, individuals typically narrow their eyes, tilt their heads upward, and shift their gaze horizontally. Conversely, in videos portraying surprise, individuals generally widen their eyes while maintaining a forward-facing head pose and gaze. The alignment of these facial cues with emotions is crucial for synthesizing realistic talking face videos. Therefore, the primary challenge of this task is to produce realistic facial videos that not only accurately reproduce lip movements in response to the driving audio but also maintain consistency between the various facial cues and corresponding emotions.
With the development of AI-Generated Content (AIGC), numerous advanced methods for generating emotional talking face have emerged. Corresponding methods can be mainly divided into audio-driven talking face generation (Wang et al., 2020; Ji et al., 2022; Gururani et al., 2023) and video-driven talking face generation (Ji et al., 2021; Wang et al., 2023a). Audio-driven talking face generation synthesizes a sequence of portrait images by using both an identity image and the corresponding audio as inputs. However, most existing works focus exclusively on lip movement and often lack the ability to generate associated facial cues. The latest work, SPACE (Gururani et al., 2023), is a multi-stage generative framework that achieves fine-grained control over facial expressions, emotion categories, head poses, and gaze directions of generated faces by manipulating the intermediate landmarks. Unfortunately, in SPACE, the control and generation of gaze and head pose sequences are unrelated to emotion. Although the generated faces display high visual quality, the final videos lack vividness and fail to differentiate emotions effectively. Video-driven talking face generation involves using a single identity image and multiple driving source videos as inputs. By employing a contrastive learning strategy, relevant features are extracted from various source videos to generate high-quality target talking face videos. These methods allow for the editing of facial attributes, such as head pose, facial expression, gaze, blinking, and audio, by modifying the corresponding source videos. However, these methods are not only expensive in the training stage but also incur additional costs in the inference stage due to the need for source video collection. We find that the fine-grained facial cues of the talking face video, such as expression, gaze and head pose, are closely related to emotional categories. Independently controlling these facial cues without considering emotion can lead to unrealistic results. Therefore, we require an audio-driven method that combines the low cost of audio-driven methods with the high alignment between facial cues and emotion to synthesize vivid emotional talking face videos.
In this paper, we propose an advanced audio-driven method to synthesize vivid emotional talking faces with emotionally aligned facial cues. We decompose the task into two sub-tasks: speech-to-landmarks synthesis and landmarks-to-face generation. Given input speech, an emotion label, and an identity image, the proposed speech-to-landmarks module can generate sequences of normalized facial landmarks (representing expressions), gaze, and head pose in an auto-regressive manner. To address the issue of substantial variations in gaze, we discretized the eye region and modeled gaze prediction as a classification task, successfully predicting gaze sequences for the first time. The alignment of emotional labels with these facial cues is achieved through self-supervised learning. These generated facial cues are synthesized into relocated 3D facial landmarks through coordinate correction and rotation. The relocated facial landmarks, driven by these cues, not only synchronize with the input audio but also enhance the alignment of expression, gaze, and pose movements with the corresponding emotion labels. We also build a collaborative emotion classifier to model the intrinsic relationships among these facial cues. Specifically, the classifier takes aggregated intermediate features from facial cues as input ensuring consistency among these facial cue sequences. The proposed landmarks-to-face module utilizes a latent keypoints space(Siarohin et al., 2021), capable of producing more realistic faces compared to traditional facial landmarks. Specifically, this module maps the relocated landmarks, generated by the speech-to-landmarks, to latent feature points. It then employs a pre-trained generator(Wang et al., 2021b) to synthesize high-quality facial images.
To the best of our knowledge, this is the first attempt to align normalized facial landmarks, gaze, and head pose concurrently with emotional categories. Compared to existing works, these explicitly aligned facial cues significantly improve the intensity and accuracy of emotional expressions in generated talking faces. Additionally, other contributions are as follows:
-
•
We introduced an advanced speech-to-landmarks synthesis auto-regressive algorithm. This algorithm can generate emotionally aligned facial cues, including normalized facial landmarks, gaze, and head pose, in an auto-regressive manner.
-
•
We designed a specific eye region discretization strategy that efficiently generates gaze sequences by modeling gaze prediction as a classification task.
-
•
Extensive experiments on the MEAD(Wang et al., 2020) dataset demonstrate the superiority of the proposed method in terms of lip synchronization and the quality of generated faces.
2. Related Work
In the following section, we provide an overview of prior research on audio-driven talking face generation and video-driven talking face generation. Table 1 outlines a summary of the key differences between our approach and other state-of-the-art methods. Specifically, a solid circle indicates that the method can automatically generate the relevant attribute or be driven by other source videos to control the relevant attribute. A half-filled circle denotes that the related attribute can only be controlled by other source videos. Conversely, an empty circle represents that the method cannot generate or control the related attribute.
Audio-driven Talking Face Generation. The objective of Speech-driven Talking Face Generation (Karras et al., 2017; Taylor et al., 2017; Wang et al., 2021a; Zhou et al., 2019; Vougioukas et al., 2019; Liu et al., 2022; Wang et al., 2023b; Shen et al., 2023; Fan et al., 2022; Du et al., 2023; Wu et al., 2023) is to establish a map** from the input speech to facial representations. MakeItTalk(Zhou et al., 2020) disentangles audio content and speaker information to control lip motion and facial expressions, and effectively works with various portrait styles. Audio2Head(Wang et al., 2021a) addresses challenges in achieving natural head motion by employing a motion-aware RNN for head pose prediction. With the further development of this field, there is a growing emphasis on controlling facial emotions in generated talking faces. Wang collected the MEAD(Wang et al., 2020) dataset and proposed a method that conditions talking head generation based on emotion labels. However, MEAD primarily focuses on controlling only the mouth region while leaving other parts unchanged, leading to a lack of continuity in the generated videos. Compared to MEAD, which uses a single emotion label as input to control the generation of video emotion categories, EAMM(Ji et al., 2022) achieves precise emotional control over the synthesized video by adopting features extracted from the emotion source video. However, the method still ignores the movement of the gaze direction and head pose, which results in less realistic generated videos. EMMN(Tan et al., 2023) and (Xu et al., 2023) employ memory networks and textual prompts, respectively, to control the emotions in the generated videos. SPACE(Gururani et al., 2023) achieves fine-grained control over facial expressions, emotion categories, head poses, and gaze directions of generated faces by decomposing the generation task into multiple subtasks. While these methods are user-friendly and straightforward, they often struggle to generate emotionally aligned facial cues for generated videos.
Video-driven Talking Face Generation. The goal of video-driven talking face generation(Drobyshev et al., 2022; Gu et al., 2020; Hong et al., 2022; Wang et al., 2022; Burkov et al., 2020; Liu et al., 2021; Xiang et al., 2020) is to accurately map facial movements from a source video onto a target image. Wave2Lip(Prajwal et al., 2020) constructs an expert discriminator to ensure precise alignment between the lip movements and the input audio. PC-AVS(Zhou et al., 2021) utilizes an implicit low-dimension pose code to separate audio-visual representations to achieve accurate lip-syncing and pose control. EVP(Ji et al., 2021) decomposes speech into two decoupled spaces to generate dynamic 2D emotional facial landmarks. PD-FGC(Wang et al., 2023a) allows for the editing of facial attributes, such as head pose, facial expression, gaze, blinking, by modifying the corresponding source videos. These methods are not only costly during the training stage, but also entail additional expenses during the inference stage due to the requirement of source video collection.
3. Two-stage Talking Face Generation
The proposed method takes the driving audio and a reference image, along with an emotion label, and produces an emotional talking face. It decomposes this task into two steps: (1) Speech-to-Landmarks Synthesis: Given a reference image, it extracts normalized landmarks, gaze labels, and head poses, and predicts their per-frame motions driven by the input speech and emotion label. Specifically, our proposed method innovatively generates cohesive sequences of normalized facial landmarks, gaze, and head pose simultaneously. Furthermore, we accomplish the collaborative alignment of these facial cues with the corresponding emotional labels using self-supervised learning. (2) Landmarks-to-Face Generation: In this step, the per-frame relocated facial landmarks are mapped to latent keypoints, which are then fed into the pre-trained model (Wang et al., 2021b) to generate the final emotional talking face. This decomposition offers multiple advantages. Firstly, it enables fine-grained control over the output facial expressions. Secondly, the two-stage training approach can reduce the training complexity and accelerate the convergence of each module. By leveraging a pretrained face generator, we can effectively reduce the training cost while obtaining high-quality emotional talking faces. The overall framework is shown in Fig. 2.
3.1. Speech-to-Landmarks Synthesis.
Given an input speech, emotion label, and an identity face image, the proposed speech-to-landmarks synthesis is capable of generating sequences of normalized facial landmarks (representing expressions), gaze, and head pose in an auto-regressive manner. These facial cues are further aligned with specific emotions through self-supervised learning. Specifically, the network for speech-to-landmarks synthesis consists of three modules: Landmark Sequentializer, Gaze Sequentializer, and Pose Sequentializer, each performing auto-regressive prediction on different facial cues sequences.
Landmark Sequentializer. In talking face generation, the quality of facial landmark generation is paramount as these landmarks directly control lip movements and expressions. Given the input audio MFCC features and the normalized 3D facial landmarks of the input image. This is represented by
(1) |
where and are the facial landmarks features and audio features at time , respectively. represents the auto-regressive generator, and is the normalized 3D facial landmarks at time . The input emotion label embedding corresponds to the indexed vectorial representation in the embedding matrix for an emotion dictionary containing emotion categories, which is learned together with the whole model. We employ a convolutional neural network (CNN) to encode the audio data, while a multi-layer perceptron (MLP) is used to encode the 3D facial landmarks. The network is trained utilizing an L1 loss function, which reduces the L1 distance between the predicted facial landmarks and the normalized ground truth landmarks. Notably, we assign a higher loss scale to the y-axis to emphasize vertical motion errors during training (Gururani et al., 2023). Using facial landmarks as an intermediary representation is advantageous as it facilitates the explicit manipulation of facial features. For instance, by manipulating the landmarks of the eyes, it becomes possible to incorporate eye blinks into the face. We discovered that conducting predictions in the normalized space is crucial to simplify the map** between phonemes and lip motions.
Pose Sequentializer. The natural movement of the head can effectively enhance the vividness of the generated video, but its motion pattern is also influenced by the emotional category. Hence, we also train an auto-regressive that predicts the rotation and translation for the facial landmarks. The rotation is represented by three angles: yaw, pitch, and roll, and the corresponding translation is the displacement of the 3D landmarks in the x-axis, y-axis, and z-axis. The prediction is represented by
(2) |
|
where , and represents the pose sequentializer. The poses, whether predicted or extracted from a reference video, are applied to the frontal normalized landmarks predicted by our Landmark Sequentializer. This transformation maps the normalized landmarks back to the image space after applying an appropriate scaling factor. The Pose Sequentializer is also trained utilizing an L1 loss.
Gaze Sequentializer. The eyes, as important organs for human interaction, contain abundant information in gaze direction, which can impact the emotion category of generated videos. In contrast to the prediction of facial landmarks and head pose, we transform the prediction of gaze direction from regression to classification by discretizing the eye regions as shown in Fig 3. It effectively enhances the accuracy of gaze direction prediction. However, in the process of predicting the sequence of gaze directions, the current gaze direction heavily relies on the previous gaze.
We adopt to model this dependency relationship. Specifically, the gaze prediction of two eyes at time n can be modeled as:
(3) |
Formally, gaze decoder performs classification at -th time step based on the hidden states by
(4) |
where denotes a linear transformation. represents calculated probabilities for a total of classification entries. is the entry with the maximal probability, from which we can infer the corresponding gaze label . Finally, we utilize the cross-entropy loss to optimize the gaze direction prediction model.
(5) |
After obtaining these three types of facial cues, we integrate the gaze and head pose data into the normalized landmarks to obtain the relocated landmarks as shown in Fig. 2. We also investigate the inherent relationships within these facial cues in the specific emotion. Therefore, we construct a collaborative emotion classifier to push consistency among these facial cue sequences. The classifier predicts the emotion category using the aggregated intermediate features from facial cues as input. Specifically, we can obtain the fake intermediate features , where is predicted during the training stage.
(6) |
where is cross entropy loss and is calculated probabilities for total emotion classification entries. The total loss for the stage of speech-to-landmarks synthesis is
(7) |
|
In fact, the generation of all facial cues incorporates emotional labels as input. By minimizing the discrepancy between the generated facial cues and the ground truth detected by off-the-shelf detectors (see Sec.4.2), we achieve alignment of expressions, gazes, and head poses with emotion labels through self-supervised learning.
3.2. Landmarks-to-Face Generation.
The field of face generation has undergone rapid development, we find that using latent keypoints as input can generate high-quality facial images. Following SPACE(Gururani et al., 2023), we utilized the pre-trained model face-vid2vid (Wang et al., 2021b), a state-of-the-art framework for generating faces from latent keypoints. This approach avoids the need to learn a facial image generator from scratch, thereby reducing computational requirements and improving efficiency. Specifically, we first use the relocated landmarks in the 3D space generated in the speech-to-landmarks synthesis stage, along with the input speech and emotion labels, as input to perform autoregressive prediction of latent keypoints in a self-supervised learning manner, as shown in Fig. 2.
(8) |
|
where the converts the normalized 3D facial landmarks to relocated facial landmarks using the gaze label and head pose obtained in the stage of speech-to-landmarks. is auto-regression generator for 3D key points. Then, given the predicted latent keypoints and the initial face, the pre-trained model generates high-quality facial images by using a flow-based war** field as intermediate variables. Finally, we can generate high-quality facial images with a resolution of 256, which is generally superior to previous works. By breaking down the generation of 3D facial landmarks into the collaborative production of three facial cues, we have simplified the task. Consequently, we selected a lightweight Bi-LSTM as the backbone for all auto-regressive generators (, , , and ), enabling the generation of high-quality facial cues and 3D keypoints with minimal computational expense.
This work fundamentally differs from SPACE (Gururani et al., 2023). SPACE focuses on precise control of the generated faces but neglects the alignment of related facial cues with emotions. Consequently, this oversight results in low differentiation among emotion categories and diminished vividness in the generated emotional talking faces. Conversely, our research prioritizes maintaining consistency between the facial cues and the driving emotion labels in generated emotional talking face videos. Furthermore, utilizing our proposed eye region discretization strategy, we have successfully generated gaze sequences for the first time.
4. Experiment
4.1. Implementation Details.
The videos were sampled at a rate of 30 frames per second (FPS), while the audio was pre-processed to 16 kHz. To extract audio features, we computed 28-dimensional MFCC using a window size of 30. We trained and evaluated our method using the MEAD dataset, an audio-visual emotional dataset comprising 60 actors/actresses and eight different emotion categories.
4.2. Dataset Preprocessing.
The proposed method adopts a self-supervised learning approach in both Speech-to-Landmarks Synthesis and Landmarks-to-Face Generation. To achieve this, we preprocess the emotional talking face videos to obtain training pairs by extracting the facial landmarks and latent keypoints for each frame. Variations in head poses within videos lead to a greater diversity of landmark movements, potentially diminishing prediction accuracy. Consequently, our work performs face normalization. Specifically, videos are aligned by centering on the nose tip and resized to a uniform resolution of 256 × 256 pixels, a data processing approach prevalent in related works. (Chen et al., 2020, 2018; Zhou et al., 2019). Given a talking-head video, we first extract per-frame facial landmarks and head poses. We adopt the Mediapipe (Lugaresi et al., 2019) landmark detector to extract 478 3D facial landmarks from each frame.To reduce computational costs and enhance model inference speed, we have selected a total of 147 facial landmarks for training while retaining the landmarks in the eye area. The per-frame head pose is obtained by the 3DDFA (Guo et al., 2020) landmark detector. We then rotate the face such that the nose tip faces straight towards the camera, aligned with the camera axis. The per-frame frontalized 3D facial landmarks will be scaled to obtain normalized landmarks with the same facial width.
The majority of existing speech-driven talking face generation methods have neglected the generation of gaze direction sequences, resulting in individuals in the generated videos maintaining their gaze forward. This omission weakens the authenticity of the generated videos. In this paper, to achieve more effective prediction of gaze direction, we discretize gaze direction and model the prediction as a classification problem rather than a regression problem. Specifically, we divide each eye area into 10 regions based on facial landmarks. During the training process, we directly predict into which region the pupil will fall. The pipeline of gaze discretization is illustrated in Fig. 3.” In addition to landmarks, we also extract latent keypoints per frame using the pretrained face-vid2vid encoder (Wang et al., 2021b). This provides per-frame pairs of (facial landmarks, latent keypoints).
4.3. Evaluation Metrics.
To evaluate the alignment between the generated face and the input audio, we calculate the Euclidean distance of facial landmarks between the generated images and the ground truth images in the mouth region (MLD (Chen et al., 2019)). We also evaluate the accuracy of facial expressions by measuring the landmarks difference on the whole face (FLD). We use the confidence scores of SyncNet (Chung and Zisserman, 2017) to evaluate the consistency between the generated face and the driving audio at the feature level. For the visual quality of the synthesized face, we use Structural Similarity (SSIM), Peak Signal to Noise Ratio (PSNR), and Frechet Inception Distances (FID) (Heusel et al., 2017) for quantitative analysis of the generated results. Additionally, we need to conduct a quantitative assessment of the effectiveness of gaze and head pose in talking face videos. Given the diversity of these facial cues, calculating their frame-by-frame differences from the ground truth is not meaningful. Therefore, we adopt Dynamic Time War** (DTW) to measure the similarity between the generated facial cues sequences and GT. Specifically, for head pose, we separately calculate the DTW for Pitch, Yaw, and Roll sequences. For gaze, we calculate the DTW for the pupil movement speed sequences.