Search | arXiv e-print repository

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model

Authors: Edresson Casanova, Kelly Davis, Eren Gölge, Görkem Göknar, Iulian Gulea, Logan Hart, Aya Aljafari, Joshua Meyer, Reuben Morais, Samuel Olayemi, Julian Weber

Abstract: Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making pub… ▽ More Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted at INTERSPEECH 2024

arXiv:2401.09512 [pdf, other]

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

Authors: Nicolas M. Müller, Piotr Kawa, Wei Herng Choong, Edresson Casanova, Eren Gölge, Thorsten Müller, Piotr Syga, Philip Sperl, Konstantin Böttinger

Abstract: Text-to-Speech (TTS) technology brings significant advantages, such as giving a voice to those with speech impairments, but also enables audio deepfakes and spoofs. The former mislead individuals and may propagate misinformation, while the latter undermine voice biometric security systems. AI-based detection can help to address these challenges by automatically differentiating between genuine and… ▽ More Text-to-Speech (TTS) technology brings significant advantages, such as giving a voice to those with speech impairments, but also enables audio deepfakes and spoofs. The former mislead individuals and may propagate misinformation, while the latter undermine voice biometric security systems. AI-based detection can help to address these challenges by automatically differentiating between genuine and fabricated voice recordings. However, these models are only as good as their training data, which currently is severely limited due to an overwhelming concentration on English and Chinese audio in anti-spoofing databases, thus restricting its worldwide effectiveness. In response, this paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 54 TTS models, comprising 21 different architectures, to generate 163.9 hours of synthetic voice in 23 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD, and observe that MLAAD demonstrates superior performance over comparable datasets like InTheWild or FakeOrReal when used as a training resource. Furthermore, in comparison with the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, both excelling on four datasets. By publishing MLAAD and making trained models accessible via an interactive webserver , we aim to democratize antispoofing technology, making it accessible beyond the realm of specialists, thus contributing to global efforts against audio spoofing and deepfakes. △ Less

Submitted 16 April, 2024; v1 submitted 17 January, 2024; originally announced January 2024.

Comments: IJCNN 2024

arXiv:2112.02418 [pdf, other]

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Authors: Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren Gölge, Moacir Antonelli Ponti

Abstract: YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our… ▽ More YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training. △ Less

Submitted 30 April, 2023; v1 submitted 4 December, 2021; originally announced December 2021.

Comments: An Erratum was added on the last page of this paper

Journal ref: Proceedings of the 39th International Conference on Machine Learning, PMLR 162:2709-2720, 2022

arXiv:2104.05557 [pdf, other]

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Authors: Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Junior, Anderson da Silva Soares, Sandra Maria Aluisio, Moacir Antonelli Ponti

Abstract: In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transform… ▽ More In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality. △ Less

Submitted 15 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

Comments: Accepted on Interspeech 2021

arXiv:1407.2987 [pdf, other]

FAME: Face Association through Model Evolution

Authors: Eren Golge, Pinar Duygulu

Abstract: We attack the problem of learning face models for public faces from weakly-labelled images collected from web through querying a name. The data is very noisy even after face detection, with several irrelevant faces corresponding to other people. We propose a novel method, Face Association through Model Evolution (FAME), that is able to prune the data in an iterative way, for the face models associ… ▽ More We attack the problem of learning face models for public faces from weakly-labelled images collected from web through querying a name. The data is very noisy even after face detection, with several irrelevant faces corresponding to other people. We propose a novel method, Face Association through Model Evolution (FAME), that is able to prune the data in an iterative way, for the face models associated to a name to evolve. The idea is based on capturing discriminativeness and representativeness of each instance and eliminating the outliers. The final models are used to classify faces on novel datasets with possibly different characteristics. On benchmark datasets, our results are comparable to or better than state-of-the-art studies for the task of face identification. △ Less

Submitted 10 July, 2014; originally announced July 2014.

Comments: Draft version of the study

arXiv:1401.0733 [pdf, other]

ConceptVision: A Flexible Scene Classification Framework

Authors: Ahmet Iscen, Eren Golge, Ilker Sarac, Pinar Duygulu

Abstract: We introduce ConceptVision, a method that aims for high accuracy in categorizing large number of scenes, while kee** the model relatively simpler and efficient for scalability. The proposed method combines the advantages of both low-level representations and high-level semantic categories, and eliminates the distinctions between different levels through the definition of concepts. The proposed f… ▽ More We introduce ConceptVision, a method that aims for high accuracy in categorizing large number of scenes, while kee** the model relatively simpler and efficient for scalability. The proposed method combines the advantages of both low-level representations and high-level semantic categories, and eliminates the distinctions between different levels through the definition of concepts. The proposed framework encodes the perspectives brought through different concepts by considering them in concept groups. Different perspectives are ensembled for the final decision. Extensive experiments are carried out on benchmark datasets to test the effects of different concepts, and methods used to ensemble. Comparisons with state-of-the-art studies show that we can achieve better results with incorporation of concepts in different levels with different perspectives. △ Less

Submitted 29 October, 2014; v1 submitted 3 January, 2014; originally announced January 2014.

arXiv:1312.4384 [pdf, other]

Rectifying Self Organizing Maps for Automatic Concept Learning from Web Images

Authors: Eren Golge, Pinar Duygulu

Abstract: We attack the problem of learning concepts automatically from noisy web image search results. Going beyond low level attributes, such as colour and texture, we explore weakly-labelled datasets for the learning of higher level concepts, such as scene categories. The idea is based on discovering common characteristics shared among subsets of images by posing a method that is able to organise the dat… ▽ More We attack the problem of learning concepts automatically from noisy web image search results. Going beyond low level attributes, such as colour and texture, we explore weakly-labelled datasets for the learning of higher level concepts, such as scene categories. The idea is based on discovering common characteristics shared among subsets of images by posing a method that is able to organise the data while eliminating irrelevant instances. We propose a novel clustering and outlier detection method, namely Rectifying Self Organizing Maps (RSOM). Given an image collection returned for a concept query, RSOM provides clusters pruned from outliers. Each cluster is used to train a model representing a different characteristics of the concept. The proposed method outperforms the state-of-the-art studies on the task of learning low-level concepts, and it is competitive in learning higher level concepts as well. It is capable to work at large scale with no supervision through exploiting the available sources. △ Less

Submitted 16 December, 2013; originally announced December 2013.

Comments: present CVPR2014 submission

Showing 1–7 of 7 results for author: Golge, E