Skip to main content

Showing 1–13 of 13 results for author: Székely, É

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.05401  [pdf, other

    eess.AS cs.HC cs.SD

    Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

    Authors: Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, p… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures. Final version, accepted to Interspeech 2024

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

  2. arXiv:2405.09768  [pdf, other

    eess.AS cs.SD

    Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

    Authors: Siyang Wang, Éva Székely

    Abstract: Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-spe… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: 11 pages, 4 figures. Language Resources and Evaluation Conference (LREC) 2024. demo: https://swatsw.github.io/lrec24_eval_slm/

  3. arXiv:2310.05181  [pdf, other

    eess.AS cs.GR cs.HC cs.LG cs.SD

    Unified speech and gesture synthesis using flow matching

    Authors: Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optima… ▽ More

    Submitted 9 January, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: 5 pages, 1 figure. Final version, accepted to IEEE ICASSP 2024

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

  4. arXiv:2309.03199  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Matcha-TTS: A fast TTS architecture with conditional flow matching

    Authors: Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic… ▽ More

    Submitted 9 January, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures. Final version, accepted to IEEE ICASSP 2024

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

  5. arXiv:2307.05132  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

    Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely

    Abstract: Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which la… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

    Comments: 7 pages, 2 figures. 12th ISCA Speech Synthesis Workshop (SSW) 2023

  6. arXiv:2306.09417  [pdf, other

    eess.AS cs.AI cs.CV cs.HC cs.LG

    Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

    Authors: Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous stat… ▽ More

    Submitted 9 August, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 7 pages, 2 figures, presented at the ISCA Speech Synthesis Workshop (SSW) 2023

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; G.3; H.5.5

  7. arXiv:2305.17971  [pdf, other

    eess.AS cs.SD

    Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis

    Authors: Erik Ekstedt, Siyang Wang, Éva Székely, Joakim Gustafson, Gabriel Skantze

    Abstract: Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To… ▽ More

    Submitted 29 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023, 5 pages, 2 figures, 4 tables

  8. arXiv:2303.02719  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

    Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely

    Abstract: Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging… ▽ More

    Submitted 10 July, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

    Comments: 5 pages, 2 figures. ICASSP Workshop SASB (Self-Supervision in Audio, Speech and Beyond)2023

    MSC Class: 68T05 ACM Class: I.2.7; I.2.6; H.5.5

    Journal ref: Proceedings of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

  9. arXiv:2211.13533  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Prosody-controllable spontaneous TTS with neural HMMs

    Authors: Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, Éva Székely

    Abstract: Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak… ▽ More

    Submitted 1 June, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: 5 pages, 3 figures, Published at ICASSP 2023

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  10. arXiv:2211.06892  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    OverFlow: Putting flows on top of neural transducers for better TTS

    Authors: Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows fo… ▽ More

    Submitted 29 May, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

    Comments: 5 pages, 2 figures. Accepted for publication at Interspeech 2023

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  11. arXiv:2108.13320  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Neural HMMs are all you need (for high-quality attention-free TTS)

    Authors: Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter

    Abstract: Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both wo… ▽ More

    Submitted 16 February, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

    Comments: 5 pages, 2 figures; final version for ICASSP 2022

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  12. arXiv:2108.11436  [pdf, other

    cs.HC cs.GR cs.LG cs.SD eess.AS

    Integrated Speech and Gesture Synthesis

    Authors: Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, Éva Székely

    Abstract: Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single m… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: 9 pages, accepted at ICMI 2021

  13. arXiv:2101.05684  [pdf, other

    cs.LG cs.GR cs.SD eess.AS

    Generating coherent spontaneous speech and gesture from text

    Authors: Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

    Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscrip… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: 3 pages, 2 figures, published at the ACM International Conference on Intelligent Virtual Agents (IVA) 2020

    MSC Class: 68T07 ACM Class: I.2.6; J.4; I.3.7; I.2.9

    Journal ref: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA '20), 2020, 3 pages