Cross-modal Face- and Voice-style Transfer

Takahashi, Naoya; Singh, Mayank K.; Mitsufuji, Yuki

Computer Science > Computer Vision and Pattern Recognition

arXiv:2302.13838 (cs)

[Submitted on 27 Feb 2023 (v1), last revised 1 Mar 2023 (this version, v2)]

Title:Cross-modal Face- and Voice-style Transfer

Authors:Naoya Takahashi, Mayank K. Singh, Yuki Mitsufuji

View PDF

Abstract:Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the content-creation process in many applications. However, as they are limited to the conversion within each modality, matching the impression of the generated face and voice remains an open question. We propose a cross-modal style transfer framework called XFaVoT that jointly learns four tasks: image translation and voice conversion tasks with audio or image guidance, which enables the generation of ``face that matches given voice" and ``voice that matches given face", and intra-modality translation tasks with a single framework. Experimental results on multiple datasets show that XFaVoT achieves cross-modal style translation of image and voice, outperforming baselines in terms of quality, diversity, and face-voice correspondence.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2302.13838 [cs.CV]
	(or arXiv:2302.13838v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2302.13838

Submission history

From: Naoya Takahashi [view email]
[v1] Mon, 27 Feb 2023 14:39:50 UTC (1,511 KB)
[v2] Wed, 1 Mar 2023 14:50:41 UTC (1,511 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-modal Face- and Voice-style Transfer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Cross-modal Face- and Voice-style Transfer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators