High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Xu, Chao; Zhu, Junwei; Zhang, Jiangning; Han, Yue; Chu, Wenqing; Tai, Ying; Wang, Chengjie; Xie, Zhifeng; Liu, Yong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.02572 (cs)

[Submitted on 4 May 2023 (v1), last revised 31 May 2023 (this version, v2)]

Title:High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Authors:Chao Xu, Junwei Zhu, Jiangning Zhang, Yue Han, Wenqing Chu, Ying Tai, Chengjie Wang, Zhifeng Xie, Yong Liu

View PDF

Abstract:Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2305.02572 [cs.CV]
	(or arXiv:2305.02572v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.02572

Submission history

From: Chao Xu [view email]
[v1] Thu, 4 May 2023 05:59:34 UTC (3,623 KB)
[v2] Wed, 31 May 2023 03:41:12 UTC (3,623 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:High-fidelity Generalized Emotional Talking Face Generation with Multi-modal Emotion Space Learning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators