Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Wu, Haozhe; Jia, Jia; Wang, Haoyu; Dou, Yishun; Duan, Chao; Deng, Qingshan

doi:10.1145/3474085.3475280

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.00203 (cs)

[Submitted on 30 Oct 2021]

Title:Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Authors:Haozhe Wu, Jia Jia, Haoyu Wang, Yishun Dou, Chao Duan, Qingshan Deng

View PDF

Abstract:People talk with diversified styles. For one piece of speech, different talking styles exhibit significant differences in the facial and head pose movements. For example, the "excited" style usually talks with the mouth wide open, while the "solemn" style is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences between different styles, it is necessary to incorporate the talking style into audio-driven talking face synthesis framework. In this paper, we propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. Specifically, we systematically investigate talking styles with our collected \textit{Ted-HD} dataset and construct style codes as several statistics of 3D morphable model~(3DMM) parameters. Afterwards, we devise a latent-style-fusion~(LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes. We emphasize the following novel characteristics of our framework: (1) It doesn't require any annotation of the style, the talking style is learned in an unsupervised manner from talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary videos, and the style codes can also be interpolated to generate new styles. Extensive experiments demonstrate that the proposed framework has the ability to synthesize more natural and expressive talking styles compared with baseline methods.

Comments:	Accepted by MM2021, code available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
ACM classes:	I.1.4
Cite as:	arXiv:2111.00203 [cs.CV]
	(or arXiv:2111.00203v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.00203
Related DOI:	https://doi.org/10.1145/3474085.3475280

Submission history

From: Haozhe Wu [view email]
[v1] Sat, 30 Oct 2021 08:15:27 UTC (10,686 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Imitating Arbitrary Talking Style for Realistic Audio-DrivenTalking Face Synthesis

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators