Skip to main content

Showing 1–31 of 31 results for author: Lorenzo-Trueba, J

.
  1. arXiv:2402.03407  [pdf, other

    eess.AS cs.CL cs.LG

    Enhancing the Stability of LLM-based Speech Generation Systems through Self-Supervised Representations

    Authors: Álvaro Martín-Cortinas, Daniel Sáez-Trigueros, Iván Vallés-Pérez, Biel Tura-Vecino, Piotr Biliński, Mateusz Lajszczak, Grzegorz Beringer, Roberto Barra-Chicote, Jaime Lorenzo-Trueba

    Abstract: Large Language Models (LLMs) are one of the most promising technologies for the next era of speech generation systems, due to their scalability and in-context learning capabilities. Nevertheless, they suffer from multiple stability issues at inference time, such as hallucinations, content skip** or speech repetitions. In this work, we introduce a new self-supervised Voice Conversion (VC) archite… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 10 pages, 1 figure, 3 tables

  2. arXiv:2307.16709  [pdf, other

    cs.CL eess.AS

    Multilingual context-based pronunciation learning for Text-to-Speech

    Authors: Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime Lorenzo-Trueba

    Abstract: Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to co… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, 5 tables. Interspeech 2023

  3. arXiv:2307.16679  [pdf, other

    eess.AS cs.CL cs.LG

    Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

    Authors: Guangyan Zhang, Thomas Merritt, Manuel Sam Ribeiro, Biel Tura-Vecino, Kayoko Yanagisawa, Kamil Pokora, Abdelhamid Ezzerg, Sebastian Cygert, Ammar Abbas, Piotr Bilinski, Roberto Barra-Chicote, Daniel Korzekwa, Jaime Lorenzo-Trueba

    Abstract: Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosod… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, 5 tables. Interspeech 2023

  4. arXiv:2307.16643  [pdf, other

    eess.AS cs.CL

    Improving grapheme-to-phoneme conversion by learning pronunciations from speech recordings

    Authors: Manuel Sam Ribeiro, Giulia Comini, Jaime Lorenzo-Trueba

    Abstract: The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation. G2P conversion is beneficial to various speech processing applications, such as text-to-speech and speech recognition. However, these tend to rely on manually-annotated pronunciation dictionaries, which are often time-consuming and costly to acquire. In this paper, we propose a method to… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 5 pages, 2 figures, 4 tables. Interspeech 2023

  5. arXiv:2207.14607  [pdf, other

    eess.AS cs.SD

    Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

    Authors: Giulia Comini, Goeric Huybrechts, Manuel Sam Ribeiro, Adam Gabrys, Jaime Lorenzo-Trueba

    Abstract: The availability of data in expressive styles across languages is limited, and recording sessions are costly and time consuming. To overcome these issues, we demonstrate how to build low-resource, neural text-to-speech (TTS) voices with only 1 hour of conversational speech, when no other conversational data are available in the same language. Assuming the availability of non-expressive speech data… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

    Comments: Accepted for presentation at Interspeech 2022

  6. Computer-assisted Pronunciation Training -- Speech synthesis is almost all you need

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Bozena Kostek

    Abstract: The research community has long studied computer-assisted pronunciation training (CAPT) methods in non-native speech. Researchers focused on studying various model architectures, such as Bayesian networks and deep learning methods, as well as on the analysis of different representations of the speech signal. Despite significant progress in recent years, existing CAPT methods are not able to detect… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

    Comments: Published in Speech Communication Journal

  7. arXiv:2202.08164  [pdf, other

    eess.AS cs.CL cs.LG

    Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

    Authors: Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba

    Abstract: State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filt… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted at ICASSP 2022

  8. arXiv:2202.05083  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Cross-speaker style transfer for text-to-speech using data augmentation

    Authors: Manuel Sam Ribeiro, Julian Roth, Giulia Comini, Goeric Huybrechts, Adam Gabrys, Jaime Lorenzo-Trueba

    Abstract: We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: 5 pages, 3 figures, 4 tables. ICASSP 2022

  9. arXiv:2108.06270  [pdf, other

    eess.AS cs.AI

    Enhancing audio quality for expressive Neural Text-to-Speech

    Authors: Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

    Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio a… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: 6 pages, 4 figures, 2 tables, SSW 2021

  10. arXiv:2106.08873  [pdf, other

    cs.SD cs.LG eess.AS

    Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

    Authors: Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati, Thomas Drugman

    Abstract: Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC method… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Presented at the Speech Synthesis Workshops 2021 (SSW11)

  11. arXiv:2106.03494  [pdf, other

    eess.AS cs.LG

    Weakly-supervised word-level pronunciation error detection in non-native English speech

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek

    Abstract: We propose a weakly-supervised model for word-level mispronunciation detection in non-native (L2) English speech. To train this model, phonetically transcribed L2 speech is not required and we only need to mark mispronounced words. The lack of phonetic transcriptions for L2 speech means that the model has to learn only from a weak signal of word-level mispronunciations. Because of that and due to… ▽ More

    Submitted 7 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  12. arXiv:2104.07777  [pdf, other

    cs.CL

    Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

    Authors: Shubhi Tyagi, Antonio Bonafonte, Jaime Lorenzo-Trueba, Javier Latorre

    Abstract: Develo** Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3% of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn maj… ▽ More

    Submitted 15 April, 2021; originally announced April 2021.

    Comments: Accepted to NAACL 2021

  13. arXiv:2101.06396  [pdf, other

    eess.AS cs.LG cs.SD

    Mispronunciation Detection in Non-native (L2) English with Uncertainty Modeling

    Authors: Daniel Korzekwa, Jaime Lorenzo-Trueba, Szymon Zaporowski, Shira Calamaro, Thomas Drugman, Bozena Kostek

    Abstract: A common approach to the automatic detection of mispronunciation in language learning is to recognize the phonemes produced by a student and compare it to the expected pronunciation of a native speaker. This approach makes two simplifying assumptions: a) phonemes can be recognized from speech with high accuracy, b) there is a single correct way for a sentence to be pronounced. These assumptions do… ▽ More

    Submitted 8 February, 2021; v1 submitted 16 January, 2021; originally announced January 2021.

    Comments: Accepted to ICASSP 2021

  14. arXiv:2101.05695  [pdf, other

    eess.AS cs.SD

    EmoCat: Language-agnostic Emotional Voice Conversion

    Authors: Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman, Jaime Lorenzo-Trueba

    Abstract: Emotional voice conversion models adapt the emotion in speech without changing the speaker identity or linguistic content. They are less data hungry than text-to-speech models and allow to generate large amounts of emotional data for downstream tasks. In this work we propose EmoCat, a language-agnostic emotional voice conversion model. It achieves high-quality emotion conversion in German with les… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: Submitted to IEEE ICASSP 2021

  15. arXiv:2012.14788  [pdf, other

    eess.AS cs.SD

    Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention

    Authors: Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek

    Abstract: This paper describes two novel complementary techniques that improve the detection of lexical stress errors in non-native (L2) English speech: attention-based feature extraction and data augmentation based on Neural Text-To-Speech (TTS). In a classical approach, audio features are usually extracted from fixed regions of speech such as the syllable nucleus. We propose an attention-based deep learni… ▽ More

    Submitted 7 June, 2021; v1 submitted 29 December, 2020; originally announced December 2020.

    Comments: Accepted to Interspeech 2021

  16. arXiv:2012.09703  [pdf, other

    eess.AS cs.SD

    Parallel WaveNet conditioned on VAE latent vectors

    Authors: Jonas Rohnke, Tom Merritt, Jaime Lorenzo-Trueba, Adam Gabrys, Vatsal Aggarwal, Alexis Moinet, Roberto Barra-Chicote

    Abstract: Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with… ▽ More

    Submitted 17 December, 2020; originally announced December 2020.

  17. arXiv:2011.05707  [pdf, other

    eess.AS cs.CL cs.SD

    Low-resource expressive text-to-speech using data augmentation

    Authors: Goeric Huybrechts, Thomas Merritt, Giulia Comini, Bartek Perz, Raahil Shah, Jaime Lorenzo-Trueba

    Abstract: While recent neural text-to-speech (TTS) systems perform remarkably well, they typically require a substantial amount of recordings from the target speaker reading in the desired speaking style. In this work, we present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of su… ▽ More

    Submitted 1 June, 2021; v1 submitted 11 November, 2020; originally announced November 2020.

  18. arXiv:2011.01175  [pdf, other

    eess.AS

    CAMP: a Two-Stage Approach to Modelling Prosody in Context

    Authors: Zack Hodari, Alexis Moinet, Sri Karlapati, Jaime Lorenzo-Trueba, Thomas Merritt, Arnaud Joly, Ammar Abbas, Penny Karanasou, Thomas Drugman

    Abstract: Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In th… ▽ More

    Submitted 12 February, 2021; v1 submitted 2 November, 2020; originally announced November 2020.

    Comments: 5 pages. Published in the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021)

  19. arXiv:1912.05289  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Voice Conversion for Whispered Speech Synthesis

    Authors: Marius Cotescu, Thomas Drugman, Goeric Huybrechts, Jaime Lorenzo-Trueba, Alexis Moinet

    Abstract: We present an approach to synthesize whisper by applying a handcrafted signal processing recipe and Voice Conversion (VC) techniques to convert normally phonated speech to whispered speech. We investigate using Gaussian Mixture Models (GMM) and Deep Neural Networks (DNN) to model the map** between acoustic features of normal speech and those of whispered speech. We evaluate naturalness and speak… ▽ More

    Submitted 17 January, 2020; v1 submitted 11 December, 2019; originally announced December 2019.

    Comments: Submitted to IEEE Signal Processing Letters

  20. Dynamic Prosody Generation for Speech Synthesis using Linguistics-Driven Acoustic Embedding Selection

    Authors: Shubhi Tyagi, Marco Nicolis, Jonas Rohnke, Thomas Drugman, Jaime Lorenzo-Trueba

    Abstract: Recent advances in Text-to-Speech (TTS) have improved quality and naturalness to near-human capabilities when considering isolated sentences. But something which is still lacking in order to achieve human-like communication is the dynamic variations and adaptability of human speech. This work attempts to solve the problem of achieving a more dynamic and natural intonation in TTS systems, particula… ▽ More

    Submitted 18 November, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Journal ref: INTERSPEECH 2020: 4407-4411

  21. arXiv:1911.12760  [pdf, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    Using VAEs and Normalizing Flows for One-shot Text-To-Speech Synthesis of Expressive Speech

    Authors: Vatsal Aggarwal, Marius Cotescu, Nishant Prateek, Jaime Lorenzo-Trueba, Roberto Barra-Chicote

    Abstract: We propose a Text-to-Speech method to create an unseen expressive style using one utterance of expressive speech of around one second. Specifically, we enhance the disentanglement capabilities of a state-of-the-art sequence-to-sequence based system with a Variational AutoEncoder (VAE) and a Householder Flow. The proposed system provides a 22% KL-divergence reduction while jointly improving percept… ▽ More

    Submitted 17 February, 2020; v1 submitted 28 November, 2019; originally announced November 2019.

    Comments: Accepted to ICASSP 2020

  22. arXiv:1911.03952  [pdf, other

    cs.SD eess.AS

    Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model

    Authors: Seyyed Saeed Sarfjoo, Xin Wang, Gustav Eje Henter, Jaime Lorenzo-Trueba, Shinji Takaki, Junichi Yamagishi

    Abstract: Nowadays vast amounts of speech data are recorded from low-quality recorder devices such as smartphones, tablets, laptops, and medium-quality microphones. The objective of this research was to study the automatic generation of high-quality speech from such low-quality device-recorded speech, which could then be applied to many speech-generation tasks. In this paper, we first introduce our new devi… ▽ More

    Submitted 20 November, 2019; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: This study was conducted during an internship of the first author at NII, Japan in 2017

  23. arXiv:1904.02790  [pdf, other

    cs.CL cs.LG eess.AS

    In Other News: A Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data

    Authors: Nishant Prateek, Mateusz Łajszczak, Roberto Barra-Chicote, Thomas Drugman, Jaime Lorenzo-Trueba, Thomas Merritt, Srikanth Ronanki, Trevor Wood

    Abstract: Neural text-to-speech synthesis (NTTS) models have shown significant progress in generating high-quality speech, however they require a large quantity of training data. This makes creating models for multiple styles expensive and time-consuming. In this paper different styles of speech are analysed based on prosodic variations, from this a model is proposed to synthesise speech in the style of a n… ▽ More

    Submitted 4 April, 2019; originally announced April 2019.

    Comments: Accepted at NAACL-HLT 2019

  24. arXiv:1811.06315  [pdf, other

    cs.CL eess.AS

    Effect of data reduction on sequence-to-sequence neural TTS

    Authors: Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, Klimkov Viacheslav

    Abstract: Recent speech synthesis systems based on sampling from autoregressive neural networks models can generate speech almost undistinguishable from human recordings. However, these models require large amounts of data. This paper shows that the lack of data from one speaker can be compensated with data from other speakers. The naturalness of Tacotron2-like models trained on a blend of 5k utterances fro… ▽ More

    Submitted 23 November, 2018; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: 4 pages, 1 extra for references. Submitted to ICASSP 2019

  25. arXiv:1811.06292  [pdf, other

    eess.AS cs.SD

    Towards achieving robust universal neural vocoding

    Authors: Jaime Lorenzo-Trueba, Thomas Drugman, Javier Latorre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, Vatsal Aggarwal

    Abstract: This paper explores the potential universality of neural vocoders. We train a WaveRNN-based vocoder on 74 speakers coming from 17 languages. This vocoder is shown to be capable of generating speech of consistently good quality (98% relative mean MUSHRA when compared to natural speech) regardless of whether the input spectrogram comes from a speaker or style seen during training or from an out-of-d… ▽ More

    Submitted 4 July, 2019; v1 submitted 15 November, 2018; originally announced November 2018.

    Comments: 4 pages, 1 extra for references. Accepted on Interspeech 2019

  26. arXiv:1807.11470  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

    Authors: Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi Yamagishi

    Abstract: Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example,… ▽ More

    Submitted 9 September, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

    Comments: 17 pages, 4 figures

    MSC Class: 62F99 ACM Class: I.2.7; G.3

  27. arXiv:1804.08438  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    A Spoofing Benchmark for the 2018 Voice Conversion Challenge: Leveraging from Spoofing Countermeasures for Speech Artifact Assessment

    Authors: Tomi Kinnunen, Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Zhenhua Ling

    Abstract: Voice conversion (VC) aims at conversion of speaker characteristic without altering content. Due to training data limitations and modeling imperfections, it is difficult to achieve believable speaker mimicry without introducing processing artifacts; performance assessment of VC, therefore, usually involves both speaker similarity and quality evaluation by a human panel. As a time-consuming, expens… ▽ More

    Submitted 4 September, 2018; v1 submitted 23 April, 2018; originally announced April 2018.

    Comments: Correction (bug fix) of a published ODYSSEY 2018 publication with the same title and author list; more details in footnote in page 1

  28. arXiv:1804.04262  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    The Voice Conversion Challenge 2018: Promoting Development of Parallel and Nonparallel Methods

    Authors: Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, Zhenhua Ling

    Abstract: We present the Voice Conversion Challenge 2018, designed as a follow up to the 2016 edition with the aim of providing a common framework for evaluating and comparing different state-of-the-art voice conversion (VC) systems. The objective of the challenge was to perform speaker conversion (i.e. transform the vocal identity) of a source speaker to a target speaker while maintaining linguistic inform… ▽ More

    Submitted 11 April, 2018; originally announced April 2018.

    Comments: Accepted for Speaker Odyssey 2018

  29. arXiv:1804.02549  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    A comparison of recent waveform generation and acoustic modeling methods for neural-network-based speech synthesis

    Authors: Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela, Junichi Yamagishi

    Abstract: Recent advances in speech synthesis suggest that limitations such as the lossy nature of the amplitude spectrum with minimum phase approximation and the over-smoothing effect in acoustic modeling can be overcome by using advanced machine learning approaches. In this paper, we build a framework in which we can fairly compare new vocoding and acoustic modeling techniques with conventional approaches… ▽ More

    Submitted 7 April, 2018; originally announced April 2018.

    Comments: To appear in ICASSP 2018

  30. arXiv:1804.00425  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    High-quality nonparallel voice conversion based on cycle-consistent adversarial network

    Authors: Fuming Fang, Junichi Yamagishi, Isao Echizen, Jaime Lorenzo-Trueba

    Abstract: Although voice conversion (VC) algorithms have achieved remarkable success along with the development of machine learning, superior performance is still difficult to achieve when using nonparallel data. In this paper, we propose using a cycle-consistent adversarial network (CycleGAN) for nonparallel data-based VC training. A CycleGAN is a generative adversarial network (GAN) originally developed f… ▽ More

    Submitted 2 April, 2018; originally announced April 2018.

    Comments: accepted at ICASSP 2018

  31. arXiv:1803.00860  [pdf, other

    eess.AS cs.CL cs.SD stat.ML

    Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obama's voice using GAN, WaveNet and low-quality found data

    Authors: Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen, Junichi Yamagishi, Tomi Kinnunen

    Abstract: Thanks to the growing availability of spoofing databases and rapid advances in using them, systems for detecting voice spoofing attacks are becoming more and more capable, and error rates close to zero are being reached for the ASVspoof2015 database. However, speech synthesis and voice conversion paradigms that are not considered in the ASVspoof2015 database are appearing. Such examples include di… ▽ More

    Submitted 2 March, 2018; originally announced March 2018.

    Comments: conference manuscript submitted to Speaker Odyssey 2018