Skip to main content

Showing 1–11 of 11 results for author: Bonafonte, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2307.07062  [pdf, other

    eess.AS cs.LG cs.SD

    Controllable Emphasis with zero data for text-to-speech

    Authors: Arnaud Joly, Marco Nicolis, Ekaterina Peterova, Alessandro Lombardi, Ammar Abbas, Arent van Korlaar, Aman Hussain, Parul Sharma, Alexis Moinet, Mateusz Lajszczak, Penny Karanasou, Antonio Bonafonte, Thomas Drugman, Elena Sokolova

    Abstract: We present a scalable method to produce high quality emphasis for text-to-speech (TTS) that does not require recordings or annotations. Many TTS models include a phoneme duration model. A simple but effective method to achieve emphasized speech consists in increasing the predicted duration of the emphasised word. We show that this is significantly better than spectrogram modification techniques im… ▽ More

    Submitted 13 July, 2023; originally announced July 2023.

    Comments: In proceeding of 12th Speech Synthesis Workshop (SSW) 2023

  2. arXiv:2212.03398   

    eess.AS cs.CL cs.SD

    Analysis and Utilization of Entrainment on Acoustic and Emotion Features in User-agent Dialogue

    Authors: Daxin Tan, Nikos Kargas, David McHardy, Constantinos Papayiannis, Antonio Bonafonte, Marek Strelec, Jonas Rohnke, Agis Oikonomou Filandras, Trevor Wood

    Abstract: Entrainment is the phenomenon by which an interlocutor adapts their speaking style to align with their partner in conversations. It has been found in different dimensions as acoustic, prosodic, lexical or syntactic. In this work, we explore and utilize the entrainment phenomenon to improve spoken dialogue systems for voice assistants. We first examine the existence of the entrainment phenomenon in… ▽ More

    Submitted 6 December, 2022; originally announced December 2022.

    Comments: This version has been removed by arXiv administrators because the submitter did not have the right to assign a license at the time of submission

  3. arXiv:2202.06409  [pdf, other

    eess.AS cs.CL cs.LG

    Distribution augmentation for low-resource expressive text-to-speech

    Authors: Mateusz Lajszczak, Animesh Prasad, Arent van Korlaar, Bajibabu Bollepalli, Antonio Bonafonte, Arnaud Joly, Marco Nicolis, Alexis Moinet, Thomas Drugman, Trevor Wood, Elena Sokolova

    Abstract: This paper presents a novel data augmentation technique for text-to-speech (TTS), that allows to generate new (text, audio) training examples without requiring any additional data. Our goal is to increase diversity of text conditionings available during training. This helps to reduce overfitting, especially in low-resource settings. Our method relies on substituting text and audio fragments in a w… ▽ More

    Submitted 19 February, 2022; v1 submitted 13 February, 2022; originally announced February 2022.

    Comments: ICASSP 2022: camera-ready

  4. arXiv:2110.12539  [pdf, other

    cs.SD cs.LG eess.AS

    Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech

    Authors: Marek Strong, Jonas Rohnke, Antonio Bonafonte, Mateusz Łajszczak, Trevor Wood

    Abstract: We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while kee** signi… ▽ More

    Submitted 14 September, 2023; v1 submitted 24 October, 2021; originally announced October 2021.

    Comments: 5 pages, 5 figures, accepted at IberSPEECH 2022

  5. arXiv:1908.07226  [pdf, other

    cs.CL cs.MM cs.SD eess.AS

    Prosodic Phrase Alignment for Machine Dubbing

    Authors: Alp Öktem, Mireia Farrús, Antonio Bonafonte

    Abstract: Dubbing is a type of audiovisual translation where dialogues are translated and enacted so that they give the impression that the media is in the target language. It requires a careful alignment of dubbed recordings with the lip movements of performers in order to achieve visual coherence. In this paper, we deal with the specific problem of prosodic phrase synchronization within the framework of m… ▽ More

    Submitted 20 August, 2019; originally announced August 2019.

    Comments: Interspeech 2019 pre-print

  6. arXiv:1906.00733  [pdf, other

    cs.SD eess.AS

    Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN

    Authors: David Álvarez, Santiago Pascual, Antonio Bonafonte

    Abstract: Text-to-speech (TTS) acoustic models map linguistic features into an acoustic representation out of which an audible waveform is generated. The latest and most natural TTS systems build a direct map** between linguistic and waveform domains, like SampleRNN. This way, possible signal naturalness losses are avoided as intermediate acoustic representations are discarded. Another important dimension… ▽ More

    Submitted 22 September, 2019; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Published at 10th ISCA Speech Synthesis Workshop

  7. arXiv:1904.03418  [pdf, other

    cs.SD cs.LG eess.AS

    Towards Generalized Speech Enhancement with Generative Adversarial Networks

    Authors: Santiago Pascual, Joan Serrà, Antonio Bonafonte

    Abstract: The speech enhancement task usually consists of removing additive noise or reverberation that partially mask spoken utterances, affecting their intelligibility. However, little attention is drawn to other, perhaps more aggressive signal distortions like clip**, chunk elimination, or frequency-band removal. Such distortions can have a large impact not only on intelligibility, but also on naturaln… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  8. arXiv:1904.03416  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

    Authors: Santiago Pascual, Mirco Ravanelli, Joan Serrà, Antonio Bonafonte, Yoshua Bengio

    Abstract: Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This… ▽ More

    Submitted 6 April, 2019; originally announced April 2019.

  9. arXiv:1808.10687  [pdf, other

    cs.SD cs.LG eess.AS

    Whispered-to-voiced Alaryngeal Speech Conversion with Generative Adversarial Networks

    Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà, Jose A. Gonzalez

    Abstract: Most methods of voice restoration for patients suffering from aphonia either produce whispered or monotone speech. Apart from intelligibility, this type of speech lacks expressiveness and naturalness due to the absence of pitch (whispered speech) or artificial generation of it (monotone speech). Existing techniques to restore prosodic information typically combine a vocoder, which parameterises th… ▽ More

    Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

  10. arXiv:1808.10678  [pdf, other

    cs.SD cs.LG eess.AS

    Self-Attention Linguistic-Acoustic Decoder

    Authors: Santiago Pascual, Antonio Bonafonte, Joan Serrà

    Abstract: The conversion from text to speech relies on the accurate map** from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure tends to make them slow to train and to sample from. In this work, w… ▽ More

    Submitted 5 November, 2018; v1 submitted 31 August, 2018; originally announced August 2018.

  11. arXiv:1712.06340  [pdf, other

    cs.SD cs.LG eess.AS

    Language and Noise Transfer in Speech Enhancement Generative Adversarial Network

    Authors: Santiago Pascual, Maruchan Park, Joan Serrà, Antonio Bonafonte, Kang-Hun Ahn

    Abstract: Speech enhancement deep learning systems usually require large amounts of training data to operate in broad conditions or real applications. This makes the adaptability of those systems into new, low resource environments an important topic. In this work, we present the results of adapting a speech enhancement generative adversarial network by finetuning the generator with small amounts of data. W… ▽ More

    Submitted 18 December, 2017; originally announced December 2017.