Skip to main content

Showing 1–3 of 3 results for author: Siuzdak, H

.
  1. arXiv:2306.00814  [pdf, other

    cs.SD cs.LG eess.AS

    Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

    Authors: Hubert Siuzdak

    Abstract: Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more a… ▽ More

    Submitted 29 May, 2024; v1 submitted 1 June, 2023; originally announced June 2023.

  2. arXiv:2203.16930  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

    Authors: Hubert Siuzdak, Piotr Dura, Pol van Rijn, Nori Jacoby

    Abstract: Recent advances in neural text-to-speech research have been dominated by two-stage pipelines utilizing low-level intermediate speech representation such as mel-spectrograms. However, such predetermined features are fundamentally limited, because they do not allow to exploit the full potential of a data-driven approach through learning hidden representations. For this reason, several end-to-end met… ▽ More

    Submitted 11 July, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

    Comments: Accepted to INTERSPEECH 2022. Audio samples are available at: https://charactr-platform.github.io/WavThruVec/

  3. arXiv:2203.15379  [pdf, other

    cs.SD cs.HC eess.AS

    VoiceMe: Personalized voice generation in TTS

    Authors: Pol van Rijn, Silvan Mertes, Dominik Schiller, Piotr Dura, Hubert Siuzdak, Peter M. C. Harrison, Elisabeth André, Nori Jacoby

    Abstract: Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high-dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sam… ▽ More

    Submitted 11 July, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted to Interspeech'22. Audio and video samples are available at: https://polvanrijn.github.io/VoiceMe/