Skip to main content

Showing 1–29 of 29 results for author: Henter, G E

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.05401  [pdf, other

    eess.AS cs.HC cs.SD

    Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

    Authors: Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, p… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures. Final version, accepted to Interspeech 2024

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

  2. arXiv:2404.19622  [pdf, other

    cs.HC cs.CV cs.GR cs.SD eess.AS

    Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

    Authors: Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson

    Abstract: Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: 13+1 pages, 2 figures, accepted at the Human Motion Generation workshop (HuMoGen) at CVPR 2024

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

  3. arXiv:2310.05181  [pdf, other

    eess.AS cs.GR cs.HC cs.LG cs.SD

    Unified speech and gesture synthesis using flow matching

    Authors: Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optima… ▽ More

    Submitted 9 January, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: 5 pages, 1 figure. Final version, accepted to IEEE ICASSP 2024

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

  4. arXiv:2309.03199  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Matcha-TTS: A fast TTS architecture with conditional flow matching

    Authors: Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic… ▽ More

    Submitted 9 January, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures. Final version, accepted to IEEE ICASSP 2024

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

  5. arXiv:2307.05132  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

    Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely

    Abstract: Self-supervised learning (SSL) speech representations learned from large amounts of diverse, mixed-quality speech data without transcriptions are gaining ground in many speech technology applications. Prior work has shown that SSL is an effective intermediate representation in two-stage text-to-speech (TTS) for both read and spontaneous speech. However, it is still not clear which SSL and which la… ▽ More

    Submitted 11 July, 2023; originally announced July 2023.

    Comments: 7 pages, 2 figures. 12th ISCA Speech Synthesis Workshop (SSW) 2023

  6. arXiv:2306.09417  [pdf, other

    eess.AS cs.AI cs.CV cs.HC cs.LG

    Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

    Authors: Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous stat… ▽ More

    Submitted 9 August, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 7 pages, 2 figures, presented at the ISCA Speech Synthesis Workshop (SSW) 2023

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; G.3; H.5.5

  7. arXiv:2306.01957  [pdf, other

    eess.AS

    Speaker-independent neural formant synthesis

    Authors: Pablo Pérez Zarazaga, Zofia Malisz, Gustav Eje Henter, Lauri Juvela

    Abstract: We describe speaker-independent speech synthesis driven by a small set of phonetically meaningful speech parameters such as formant frequencies. The intention is to leverage deep-learning advances to provide a highly realistic signal generator that includes control affordances required for stimulus creation in the speech sciences. Our approach turns input speech parameters into predicted mel-spect… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: 5 pages, 4 figures. Article accepted at INTERSPEECH 2023

  8. arXiv:2303.07442  [pdf, other

    eess.AS cs.SD

    A processing framework to access large quantities of whispered speech found in ASMR

    Authors: Pablo Perez Zarazaga, Gustav Eje Henter, Zofia Malisz

    Abstract: Whispering is a ubiquitous mode of communication that humans use daily. Despite this, whispered speech has been poorly served by existing speech technology due to a shortage of resources and processing methodology. To remedy this, this paper provides a processing framework that enables access to large and unique data of high-quality whispered speech. We obtain the data from recordings submitted to… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Accepted at ICASSP 2023, 5 pages, 2 figures, 2 tables

  9. arXiv:2303.02719  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

    Authors: Siyang Wang, Gustav Eje Henter, Joakim Gustafson, Éva Székely

    Abstract: Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging… ▽ More

    Submitted 10 July, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

    Comments: 5 pages, 2 figures. ICASSP Workshop SASB (Self-Supervision in Audio, Speech and Beyond)2023

    MSC Class: 68T05 ACM Class: I.2.7; I.2.6; H.5.5

    Journal ref: Proceedings of the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)

  10. arXiv:2211.13533  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Prosody-controllable spontaneous TTS with neural HMMs

    Authors: Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, Éva Székely

    Abstract: Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak… ▽ More

    Submitted 1 June, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: 5 pages, 3 figures, Published at ICASSP 2023

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  11. arXiv:2211.09707  [pdf, other

    cs.LG cs.CV cs.GR cs.HC cs.SD eess.AS

    Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

    Authors: Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

    Abstract: Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the Diff… ▽ More

    Submitted 16 May, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: 20 pages, 9 figures. Published in ACM ToG and presented at SIGGRAPH 2023

    MSC Class: 68T07 ACM Class: G.3; I.2.6; I.3.7; J.5

    Journal ref: ACM Trans. Graph. 42, 4 (August 2023), 20 pages

  12. Autovocoder: Fast Waveform Generation from a Learned Speech Representation using Differentiable Digital Signal Processing

    Authors: Jacob J Webber, Cassia Valentini-Botinhao, Evelyn Williams, Gustav Eje Henter, Simon King

    Abstract: Most state-of-the-art Text-to-Speech systems use the mel-spectrogram as an intermediate representation, to decompose the task into acoustic modelling and waveform generation. A mel-spectrogram is extracted from the waveform by a simple, fast DSP operation, but generating a high-quality waveform from a mel-spectrogram requires computationally expensive machine learning: a neural vocoder. Our prop… ▽ More

    Submitted 24 May, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

    Comments: Accepted to the 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2023)

    Journal ref: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1-5

  13. arXiv:2211.06892  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    OverFlow: Putting flows on top of neural transducers for better TTS

    Authors: Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows fo… ▽ More

    Submitted 29 May, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

    Comments: 5 pages, 2 figures. Accepted for publication at Interspeech 2023

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  14. Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

    Authors: Cassia Valentini-Botinhao, Manuel Sam Ribeiro, Oliver Watts, Korin Richmond, Gustav Eje Henter

    Abstract: Automatically predicting the outcome of subjective listening tests is a challenging task. Ratings may vary from person to person even if preferences are consistent across listeners. While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text.… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Journal ref: Proceedings of INTERSPEECH 2022

  15. arXiv:2208.10441  [pdf, other

    cs.HC cs.GR cs.LG cs.MM cs.SD eess.AS

    The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

    Authors: Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

    Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff… ▽ More

    Submitted 22 August, 2022; originally announced August 2022.

    Comments: 12 pages, 5 figures; final version for ACM ICMI 2022

    ACM Class: I.3; I.2

  16. arXiv:2202.10973  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Wavebender GAN: An architecture for phonetically meaningful speech manipulation

    Authors: Gustavo Teodoro Döhler Beck, Ulme Wennberg, Zofia Malisz, Gustav Eje Henter

    Abstract: Deep learning has revolutionised synthetic speech quality. However, it has thus far delivered little value to the speech science community. The new methods do not meet the controllability demands that practitioners in this area require e.g.: in listening tests with manipulated speech stimuli. Instead, control of different speech properties in such stimuli is achieved by using legacy signal-process… ▽ More

    Submitted 22 February, 2022; originally announced February 2022.

    Comments: 5 pages, 4 figures; to appear at ICASSP 2022

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; J.5; H.5.5

  17. arXiv:2108.13320  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Neural HMMs are all you need (for high-quality attention-free TTS)

    Authors: Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter

    Abstract: Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both wo… ▽ More

    Submitted 16 February, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

    Comments: 5 pages, 2 figures; final version for ICASSP 2022

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  18. arXiv:2108.11436  [pdf, other

    cs.HC cs.GR cs.LG cs.SD eess.AS

    Integrated Speech and Gesture Synthesis

    Authors: Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, Éva Székely

    Abstract: Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single m… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: 9 pages, accepted at ICMI 2021

  19. arXiv:2107.00730  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability

    Authors: Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, Saikat Chatterjee

    Abstract: In pursuit of explainability, we develop generative models for sequential data. The proposed models provide state-of-the-art classification results and robust performance for speech phone classification. We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs). Normalizing flow-based mixture models (NMMs) are used to model the conditiona… ▽ More

    Submitted 1 July, 2021; originally announced July 2021.

    Comments: 12 pages, 4 figures

  20. arXiv:2106.13871  [pdf, other

    cs.SD cs.GR cs.LG eess.AS

    Transflower: probabilistic autoregressive dance generation with multimodal attention

    Authors: Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, Simon Alexanderson

    Abstract: Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic au… ▽ More

    Submitted 11 June, 2022; v1 submitted 25 June, 2021; originally announced June 2021.

    Comments: Article presented at SIGGRAPH Asia 2021, and published in ACM Transactions on Graphics

  21. Robust Classification using Hidden Markov Models and Mixtures of Normalizing Flows

    Authors: Anubhab Ghosh, Antoine Honoré, Dong Liu, Gustav Eje Henter, Saikat Chatterjee

    Abstract: We test the robustness of a maximum-likelihood (ML) based classifier where sequential data as observation is corrupted by noise. The hypothesis is that a generative model, that combines the state transitions of a hidden Markov model (HMM) and the neural network based probability distributions for the hidden states of the HMM, can provide a robust classification performance. The combined model is c… ▽ More

    Submitted 14 February, 2021; originally announced February 2021.

    Comments: 6 pages. Accepted at MLSP 2020

  22. arXiv:2101.05684  [pdf, other

    cs.LG cs.GR cs.SD eess.AS

    Generating coherent spontaneous speech and gesture from text

    Authors: Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

    Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscrip… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: 3 pages, 2 figures, published at the ACM International Conference on Intelligent Virtual Agents (IVA) 2020

    MSC Class: 68T07 ACM Class: I.2.6; J.4; I.3.7; I.2.9

    Journal ref: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA '20), 2020, 3 pages

  23. arXiv:2006.09888  [pdf, other

    cs.CV cs.HC cs.LG cs.SD eess.AS eess.IV stat.ML

    Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings

    Authors: Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow

    Abstract: To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synt… ▽ More

    Submitted 22 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: Best Paper Award. 8 pages, 4 figures, IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent

  24. arXiv:2001.09326  [pdf, other

    cs.HC cs.LG eess.AS

    Gesticulator: A framework for semantically-aware speech-driven gesture generation

    Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, Hedvig Kjellström

    Abstract: During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoust… ▽ More

    Submitted 14 January, 2021; v1 submitted 25 January, 2020; originally announced January 2020.

    Comments: ICMI 2020 Best Paper Award. Code is available. 9 pages, 6 figures

    ACM Class: I.2.7; I.2.6; I.3.7

    Journal ref: Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI '20)

  25. arXiv:1911.03952  [pdf, other

    cs.SD eess.AS

    Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model

    Authors: Seyyed Saeed Sarfjoo, Xin Wang, Gustav Eje Henter, Jaime Lorenzo-Trueba, Shinji Takaki, Junichi Yamagishi

    Abstract: Nowadays vast amounts of speech data are recorded from low-quality recorder devices such as smartphones, tablets, laptops, and medium-quality microphones. The objective of this research was to study the automatic generation of high-quality speech from such low-quality device-recorded speech, which could then be applied to many speech-generation tasks. In this paper, we first introduce our new devi… ▽ More

    Submitted 20 November, 2019; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: This study was conducted during an internship of the first author at NII, Japan in 2017

  26. arXiv:1905.06598  [pdf, other

    cs.LG cs.GR eess.IV stat.ML

    MoGlow: Probabilistic and controllable motion synthesis using normalising flows

    Authors: Gustav Eje Henter, Simon Alexanderson, Jonas Beskow

    Abstract: Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unl… ▽ More

    Submitted 7 December, 2020; v1 submitted 16 May, 2019; originally announced May 2019.

    Comments: 14 pages, 5 figures, published in ACM Transactions on Graphics and presented at SIGGRAPH Asia 2020

    ACM Class: I.3.7; G.3; I.2.6

    Journal ref: ACM Trans. Graph. 39, 4, Article 236 (November 2020), 14 pages

  27. arXiv:1807.11470  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

    Authors: Gustav Eje Henter, Jaime Lorenzo-Trueba, Xin Wang, Junichi Yamagishi

    Abstract: Generating versatile and appropriate synthetic speech requires control over the output expression separate from the spoken text. Important non-textual speech variation is seldom annotated, in which case output control must be learned in an unsupervised fashion. In this paper, we perform an in-depth study of methods for unsupervised learning of control in statistical speech synthesis. For example,… ▽ More

    Submitted 9 September, 2018; v1 submitted 30 July, 2018; originally announced July 2018.

    Comments: 17 pages, 4 figures

    MSC Class: 62F99 ACM Class: I.2.7; G.3

  28. arXiv:1807.11320  [pdf, other

    cs.LG eess.SP stat.ML

    Kernel Density Estimation-Based Markov Models with Hidden State

    Authors: Gustav Eje Henter, Arne Leijon, W. Bastiaan Kleijn

    Abstract: We consider Markov models of stochastic processes where the next-step conditional distribution is defined by a kernel density estimator (KDE), similar to Markov forecast densities and certain time-series bootstrap schemes. The KDE Markov models (KDE-MMs) we discuss are nonlinear, nonparametric, fully probabilistic representations of stationary processes, based on techniques with strong asymptotic… ▽ More

    Submitted 30 July, 2018; originally announced July 2018.

    Comments: 14 pages, 6 figures

    MSC Class: 62M10; 62G07 ACM Class: G.3

  29. arXiv:1807.10941  [pdf, other

    eess.AS cs.SD

    Analysing Shortcomings of Statistical Parametric Speech Synthesis

    Authors: Gustav Eje Henter, Simon King, Thomas Merritt, Gilles Degottex

    Abstract: Output from statistical parametric speech synthesis (SPSS) remains noticeably worse than natural speech recordings in terms of quality, naturalness, speaker similarity, and intelligibility in noise. There are many hypotheses regarding the origins of these shortcomings, but these hypotheses are often kept vague and presented without empirical evidence that could confirm and quantify how a specific… ▽ More

    Submitted 28 July, 2018; originally announced July 2018.

    Comments: 34 pages with 4 figures; draft book chapter

    ACM Class: I.2.7; H.5.5