Skip to main content

Showing 1–8 of 8 results for author: Gabrys, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.01385  [pdf

    eess.AS cs.SD

    Del Visual al Auditivo: Sonorización de Escenas Guiada por Imagen

    Authors: María Sánchez, Laura Fernández, Julián Arias, Mateo Cámara, Giulia Comini, Adam Gabrys, José Luis Blanco, Juan Ignacio Godino, Luis Alfonso Hernández

    Abstract: Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the greatest challenges for the automatic generation of multimodal content. We present a processing flow that… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

    Comments: 10 pages, in Spanish, Tecniacústica

  2. arXiv:2207.14607  [pdf, other

    eess.AS cs.SD

    Low-data? No problem: low-resource, language-agnostic conversational text-to-speech via F0-conditioned data augmentation

    Authors: Giulia Comini, Goeric Huybrechts, Manuel Sam Ribeiro, Adam Gabrys, Jaime Lorenzo-Trueba

    Abstract: The availability of data in expressive styles across languages is limited, and recording sessions are costly and time consuming. To overcome these issues, we demonstrate how to build low-resource, neural text-to-speech (TTS) voices with only 1 hour of conversational speech, when no other conversational data are available in the same language. Assuming the availability of non-expressive speech data… ▽ More

    Submitted 29 July, 2022; originally announced July 2022.

    Comments: Accepted for presentation at Interspeech 2022

  3. arXiv:2202.08164  [pdf, other

    eess.AS cs.CL cs.LG

    Voice Filter: Few-shot text-to-speech speaker adaptation using voice conversion as a post-processing module

    Authors: Adam Gabryś, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime Lorenzo-Trueba

    Abstract: State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech. When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations, making training low-resource TTS systems problematic. In this paper, we propose a novel extremely low-resource TTS method called Voice Filt… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

    Comments: Accepted at ICASSP 2022

  4. arXiv:2202.05083  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Cross-speaker style transfer for text-to-speech using data augmentation

    Authors: Manuel Sam Ribeiro, Julian Roth, Giulia Comini, Goeric Huybrechts, Adam Gabrys, Jaime Lorenzo-Trueba

    Abstract: We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach… ▽ More

    Submitted 10 February, 2022; originally announced February 2022.

    Comments: 5 pages, 3 figures, 4 tables. ICASSP 2022

  5. arXiv:2108.06270  [pdf, other

    eess.AS cs.AI

    Enhancing audio quality for expressive Neural Text-to-Speech

    Authors: Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba, Viacheslav Klimkov

    Abstract: Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio a… ▽ More

    Submitted 13 August, 2021; originally announced August 2021.

    Comments: 6 pages, 4 figures, 2 tables, SSW 2021

  6. Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

    Authors: Adam Gabryś, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote

    Abstract: This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021, 5 pages,3 figures

  7. arXiv:2102.01106  [pdf, other

    eess.AS cs.CL cs.SD

    Universal Neural Vocoding with Parallel WaveNet

    Authors: Yunlong Jiao, Adam Gabrys, Georgi Tinchev, Bartosz Putrycz, Daniel Korzekwa, Viacheslav Klimkov

    Abstract: We present a universal neural vocoder based on Parallel WaveNet, with an additional conditioning network called Audio Encoder. Our universal vocoder offers real-time high-quality speech synthesis on a wide range of use cases. We tested it on 43 internal speakers of diverse age and gender, speaking 20 languages in 17 unique styles, of which 7 voices and 5 styles were not exposed during training. We… ▽ More

    Submitted 15 February, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

    Comments: 5 pages, 2 figures. Accepted to ICASSP 2021

  8. arXiv:2012.09703  [pdf, other

    eess.AS cs.SD

    Parallel WaveNet conditioned on VAE latent vectors

    Authors: Jonas Rohnke, Tom Merritt, Jaime Lorenzo-Trueba, Adam Gabrys, Vatsal Aggarwal, Alexis Moinet, Roberto Barra-Chicote

    Abstract: Recently the state-of-the-art text-to-speech synthesis systems have shifted to a two-model approach: a sequence-to-sequence model to predict a representation of speech (typically mel-spectrograms), followed by a 'neural vocoder' model which produces the time-domain speech waveform from this intermediate speech representation. This approach is capable of synthesizing speech that is confusable with… ▽ More

    Submitted 17 December, 2020; originally announced December 2020.