-
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Authors:
Edresson Casanova,
Kelly Davis,
Eren Gölge,
Görkem Göknar,
Iulian Gulea,
Logan Hart,
Aya Aljafari,
Joshua Meyer,
Reuben Morais,
Samuel Olayemi,
Julian Weber
Abstract:
Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making pub…
▽ More
Most Zero-shot Multi-speaker TTS (ZS-TTS) systems support only a single language. Although models like YourTTS, VALL-E X, Mega-TTS 2, and Voicebox explored Multilingual ZS-TTS they are limited to just a few high/medium resource languages, limiting the applications of these models in most of the low/medium resource languages. In this paper, we aim to alleviate this issue by proposing and making publicly available the XTTS system. Our method builds upon the Tortoise model and adds several novel modifications to enable multilingual training, improve voice cloning, and enable faster training and inference. XTTS was trained in 16 languages and achieved state-of-the-art (SOTA) results in most of them.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
MLAAD: The Multi-Language Audio Anti-Spoofing Dataset
Authors:
Nicolas M. Müller,
Piotr Kawa,
Wei Herng Choong,
Edresson Casanova,
Eren Gölge,
Thorsten Müller,
Piotr Syga,
Philip Sperl,
Konstantin Böttinger
Abstract:
Text-to-Speech (TTS) technology brings significant advantages, such as giving a voice to those with speech impairments, but also enables audio deepfakes and spoofs. The former mislead individuals and may propagate misinformation, while the latter undermine voice biometric security systems. AI-based detection can help to address these challenges by automatically differentiating between genuine and…
▽ More
Text-to-Speech (TTS) technology brings significant advantages, such as giving a voice to those with speech impairments, but also enables audio deepfakes and spoofs. The former mislead individuals and may propagate misinformation, while the latter undermine voice biometric security systems. AI-based detection can help to address these challenges by automatically differentiating between genuine and fabricated voice recordings. However, these models are only as good as their training data, which currently is severely limited due to an overwhelming concentration on English and Chinese audio in anti-spoofing databases, thus restricting its worldwide effectiveness. In response, this paper presents the Multi-Language Audio Anti-Spoof Dataset (MLAAD), created using 54 TTS models, comprising 21 different architectures, to generate 163.9 hours of synthetic voice in 23 different languages. We train and evaluate three state-of-the-art deepfake detection models with MLAAD, and observe that MLAAD demonstrates superior performance over comparable datasets like InTheWild or FakeOrReal when used as a training resource. Furthermore, in comparison with the renowned ASVspoof 2019 dataset, MLAAD proves to be a complementary resource. In tests across eight datasets, MLAAD and ASVspoof 2019 alternately outperformed each other, both excelling on four datasets. By publishing MLAAD and making trained models accessible via an interactive webserver , we aim to democratize antispoofing technology, making it accessible beyond the realm of specialists, thus contributing to global efforts against audio spoofing and deepfakes.
△ Less
Submitted 16 April, 2024; v1 submitted 17 January, 2024;
originally announced January 2024.
-
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
Authors:
Edresson Casanova,
Julian Weber,
Christopher Shulby,
Arnaldo Candido Junior,
Eren Gölge,
Moacir Antonelli Ponti
Abstract:
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our…
▽ More
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
△ Less
Submitted 30 April, 2023; v1 submitted 4 December, 2021;
originally announced December 2021.
-
SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model
Authors:
Edresson Casanova,
Christopher Shulby,
Eren Gölge,
Nicolas Michael Müller,
Frederico Santos de Oliveira,
Arnaldo Candido Junior,
Anderson da Silva Soares,
Sandra Maria Aluisio,
Moacir Antonelli Ponti
Abstract:
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transform…
▽ More
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can significantly improve the similarity and speech quality for new speakers. Our model converges using only 11 speakers, reaching state-of-the-art results for similarity with new speakers, as well as high speech quality.
△ Less
Submitted 15 June, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
FAME: Face Association through Model Evolution
Authors:
Eren Golge,
Pinar Duygulu
Abstract:
We attack the problem of learning face models for public faces from weakly-labelled images collected from web through querying a name. The data is very noisy even after face detection, with several irrelevant faces corresponding to other people. We propose a novel method, Face Association through Model Evolution (FAME), that is able to prune the data in an iterative way, for the face models associ…
▽ More
We attack the problem of learning face models for public faces from weakly-labelled images collected from web through querying a name. The data is very noisy even after face detection, with several irrelevant faces corresponding to other people. We propose a novel method, Face Association through Model Evolution (FAME), that is able to prune the data in an iterative way, for the face models associated to a name to evolve. The idea is based on capturing discriminativeness and representativeness of each instance and eliminating the outliers. The final models are used to classify faces on novel datasets with possibly different characteristics. On benchmark datasets, our results are comparable to or better than state-of-the-art studies for the task of face identification.
△ Less
Submitted 10 July, 2014;
originally announced July 2014.
-
ConceptVision: A Flexible Scene Classification Framework
Authors:
Ahmet Iscen,
Eren Golge,
Ilker Sarac,
Pinar Duygulu
Abstract:
We introduce ConceptVision, a method that aims for high accuracy in categorizing large number of scenes, while kee** the model relatively simpler and efficient for scalability. The proposed method combines the advantages of both low-level representations and high-level semantic categories, and eliminates the distinctions between different levels through the definition of concepts. The proposed f…
▽ More
We introduce ConceptVision, a method that aims for high accuracy in categorizing large number of scenes, while kee** the model relatively simpler and efficient for scalability. The proposed method combines the advantages of both low-level representations and high-level semantic categories, and eliminates the distinctions between different levels through the definition of concepts. The proposed framework encodes the perspectives brought through different concepts by considering them in concept groups. Different perspectives are ensembled for the final decision. Extensive experiments are carried out on benchmark datasets to test the effects of different concepts, and methods used to ensemble. Comparisons with state-of-the-art studies show that we can achieve better results with incorporation of concepts in different levels with different perspectives.
△ Less
Submitted 29 October, 2014; v1 submitted 3 January, 2014;
originally announced January 2014.
-
Rectifying Self Organizing Maps for Automatic Concept Learning from Web Images
Authors:
Eren Golge,
Pinar Duygulu
Abstract:
We attack the problem of learning concepts automatically from noisy web image search results. Going beyond low level attributes, such as colour and texture, we explore weakly-labelled datasets for the learning of higher level concepts, such as scene categories. The idea is based on discovering common characteristics shared among subsets of images by posing a method that is able to organise the dat…
▽ More
We attack the problem of learning concepts automatically from noisy web image search results. Going beyond low level attributes, such as colour and texture, we explore weakly-labelled datasets for the learning of higher level concepts, such as scene categories. The idea is based on discovering common characteristics shared among subsets of images by posing a method that is able to organise the data while eliminating irrelevant instances. We propose a novel clustering and outlier detection method, namely Rectifying Self Organizing Maps (RSOM). Given an image collection returned for a concept query, RSOM provides clusters pruned from outliers. Each cluster is used to train a model representing a different characteristics of the concept. The proposed method outperforms the state-of-the-art studies on the task of learning low-level concepts, and it is competitive in learning higher level concepts as well. It is capable to work at large scale with no supervision through exploiting the available sources.
△ Less
Submitted 16 December, 2013;
originally announced December 2013.