Skip to main content

Showing 1–21 of 21 results for author: Beskow, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.05401  [pdf, other

    eess.AS cs.HC cs.SD

    Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

    Authors: Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, p… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

    Comments: 5 pages, 2 figures. Final version, accepted to Interspeech 2024

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

  2. arXiv:2404.19622  [pdf, other

    cs.HC cs.CV cs.GR cs.SD eess.AS

    Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

    Authors: Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson

    Abstract: Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

    Comments: 13+1 pages, 2 figures, accepted at the Human Motion Generation workshop (HuMoGen) at CVPR 2024

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

  3. arXiv:2310.05181  [pdf, other

    eess.AS cs.GR cs.HC cs.LG cs.SD

    Unified speech and gesture synthesis using flow matching

    Authors: Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optima… ▽ More

    Submitted 9 January, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

    Comments: 5 pages, 1 figure. Final version, accepted to IEEE ICASSP 2024

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

  4. arXiv:2309.05455  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

    Authors: Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow

    Abstract: This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these mod… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

    MSC Class: 68T42 ACM Class: I.2.6; I.2.7

  5. arXiv:2309.03199  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Matcha-TTS: A fast TTS architecture with conditional flow matching

    Authors: Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic… ▽ More

    Submitted 9 January, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: 5 pages, 3 figures. Final version, accepted to IEEE ICASSP 2024

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

  6. arXiv:2306.09417  [pdf, other

    eess.AS cs.AI cs.CV cs.HC cs.LG

    Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

    Authors: Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous stat… ▽ More

    Submitted 9 August, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

    Comments: 7 pages, 2 figures, presented at the ISCA Speech Synthesis Workshop (SSW) 2023

    MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; G.3; H.5.5

  7. arXiv:2211.09707  [pdf, other

    cs.LG cs.CV cs.GR cs.HC cs.SD eess.AS

    Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

    Authors: Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

    Abstract: Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the Diff… ▽ More

    Submitted 16 May, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

    Comments: 20 pages, 9 figures. Published in ACM ToG and presented at SIGGRAPH 2023

    MSC Class: 68T07 ACM Class: G.3; I.2.6; I.3.7; J.5

    Journal ref: ACM Trans. Graph. 42, 4 (August 2023), 20 pages

  8. arXiv:2211.06892  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    OverFlow: Putting flows on top of neural transducers for better TTS

    Authors: Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter

    Abstract: Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows fo… ▽ More

    Submitted 29 May, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

    Comments: 5 pages, 2 figures. Accepted for publication at Interspeech 2023

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  9. arXiv:2109.01206  [pdf, other

    cs.RO cs.HC

    Mechanical Chameleons: Evaluating the effects of a social robot's non-verbal behavior on social influence

    Authors: Patrik Jonell, Anna Deichler, Ilaria Torre, Iolanda Leite, Jonas Beskow

    Abstract: In this paper we present a pilot study which investigates how non-verbal behavior affects social influence in social robots. We also present a modular system which is capable of controlling the non-verbal behavior based on the interlocutor's facial gestures (head movements and facial expressions) in real time, and a study investigating whether three different strategies for facial gestures ("still… ▽ More

    Submitted 2 September, 2021; originally announced September 2021.

    Comments: In proceedings of SCRITA 2021 (arXiv:2108.08092), a workshop at IEEE RO-MAN 2021: https://ro-man2021.org/

    Report number: SCRITA/2021/07

  10. arXiv:2108.13320  [pdf, other

    eess.AS cs.HC cs.LG cs.SD

    Neural HMMs are all you need (for high-quality attention-free TTS)

    Authors: Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter

    Abstract: Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both wo… ▽ More

    Submitted 16 February, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

    Comments: 5 pages, 2 figures; final version for ICASSP 2022

    MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

  11. arXiv:2108.11436  [pdf, other

    cs.HC cs.GR cs.LG cs.SD eess.AS

    Integrated Speech and Gesture Synthesis

    Authors: Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, Éva Székely

    Abstract: Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single m… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: 9 pages, accepted at ICMI 2021

  12. arXiv:2106.13871  [pdf, other

    cs.SD cs.GR cs.LG eess.AS

    Transflower: probabilistic autoregressive dance generation with multimodal attention

    Authors: Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, Simon Alexanderson

    Abstract: Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic au… ▽ More

    Submitted 11 June, 2022; v1 submitted 25 June, 2021; originally announced June 2021.

    Comments: Article presented at SIGGRAPH Asia 2021, and published in ACM Transactions on Graphics

  13. arXiv:2101.05684  [pdf, other

    cs.LG cs.GR cs.SD eess.AS

    Generating coherent spontaneous speech and gesture from text

    Authors: Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

    Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscrip… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: 3 pages, 2 figures, published at the ACM International Conference on Intelligent Virtual Agents (IVA) 2020

    MSC Class: 68T07 ACM Class: I.2.6; J.4; I.3.7; I.2.9

    Journal ref: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA '20), 2020, 3 pages

  14. Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents

    Authors: Patrik Jonell, Taras Kucherenko, Ilaria Torre, Jonas Beskow

    Abstract: Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to r… ▽ More

    Submitted 23 October, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

    Comments: Patrik Jonell and Taras Kucherenko contributed equally to this work. Published at the Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent. 8 pages, 7 figures

  15. arXiv:2006.09888  [pdf, other

    cs.CV cs.HC cs.LG cs.SD eess.AS eess.IV stat.ML

    Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings

    Authors: Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow

    Abstract: To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synt… ▽ More

    Submitted 22 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: Best Paper Award. 8 pages, 4 figures, IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent

  16. arXiv:1905.06598  [pdf, other

    cs.LG cs.GR eess.IV stat.ML

    MoGlow: Probabilistic and controllable motion synthesis using normalising flows

    Authors: Gustav Eje Henter, Simon Alexanderson, Jonas Beskow

    Abstract: Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unl… ▽ More

    Submitted 7 December, 2020; v1 submitted 16 May, 2019; originally announced May 2019.

    Comments: 14 pages, 5 figures, published in ACM Transactions on Graphics and presented at SIGGRAPH Asia 2020

    ACM Class: I.3.7; G.3; I.2.6

    Journal ref: ACM Trans. Graph. 39, 4, Article 236 (November 2020), 14 pages

  17. arXiv:1901.10461  [pdf, other

    cs.HC cs.RO

    The effect of a physical robot on vocabulary learning

    Authors: Andreas Wedenborn, Preben Wik, Olov Engwall, Jonas Beskow

    Abstract: This study investigates the effect of a physical robot taking the role of a teacher or exercise partner in a language learning exercise. In order to investigate this, an application was developed enabling a 2:nd language learning vocabulary exercise in three different conditions. In the first condition the learner would receive tutoring from a disembodied voice, in the second condition the tutor w… ▽ More

    Submitted 30 January, 2019; originally announced January 2019.

    Comments: This study was presented at the International Workshop on Spoken Dialogue Systems (IWSDS 2016), Saariselkä, Finland., January 2016

  18. arXiv:1803.02665  [pdf, other

    cs.LG

    A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture

    Authors: Taras Kucherenko, Jonas Beskow, Hedvig Kjellström

    Abstract: Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing… ▽ More

    Submitted 25 September, 2018; v1 submitted 7 March, 2018; originally announced March 2018.

    Comments: 7 pages, 6 figures

    MSC Class: 68T05

  19. arXiv:1711.08992  [pdf, other

    cs.CV cs.CL cs.HC cs.LG stat.ML

    Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

    Authors: Kalin Stefanov, Jonas Beskow, Giampiero Salvi

    Abstract: This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robus… ▽ More

    Submitted 18 July, 2019; v1 submitted 24 November, 2017; originally announced November 2017.

    Comments: 10 pages, IEEE Transactions on Cognitive and Developmental Systems

    ACM Class: I.2; I.4; I.5

  20. arXiv:1709.01613  [pdf, other

    cs.HC cs.AI cs.CY

    Machine Learning and Social Robotics for Detecting Early Signs of Dementia

    Authors: Patrik Jonell, Joseph Mendelson, Thomas Storskog, Goran Hagman, Per Ostberg, Iolanda Leite, Taras Kucherenko, Olga Mikheeva, Ulrika Akenine, Vesna Jelic, Alina Solomon, Jonas Beskow, Joakim Gustafson, Miia Kivipelto, Hedvig Kjellstrom

    Abstract: This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The inte… ▽ More

    Submitted 5 September, 2017; originally announced September 2017.

  21. arXiv:1211.3901  [pdf, other

    cs.CV

    Visual Recognition of Isolated Swedish Sign Language Signs

    Authors: Saad Akram, Jonas Beskow, Hedvig Kjellstrom

    Abstract: We present a method for recognition of isolated Swedish Sign Language signs. The method will be used in a game intended to help children training signing at home, as a complement to training with a teacher. The target group is not primarily deaf children, but children with language disorders. Using sign language as a support in conversation has been shown to greatly stimulate the speech developmen… ▽ More

    Submitted 16 November, 2012; originally announced November 2012.