Search | arXiv e-print repository

Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech

Authors: Shivam Mehta, Harm Lameris, Rajiv Punmiya, Jonas Beskow, Éva Székely, Gustav Eje Henter

Abstract: Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, p… ▽ More Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech. Please see https://shivammehta25.github.io/prob_dur/ for audio and resources. △ Less

Submitted 8 June, 2024; originally announced June 2024.

Comments: 5 pages, 2 figures. Final version, accepted to Interspeech 2024

MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

arXiv:2404.19622 [pdf, other]

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

Authors: Shivam Mehta, Anna Deichler, Jim O'Regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson

Abstract: Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of… ▽ More Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data. See https://shivammehta25.github.io/MAGI/ for example output. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: 13+1 pages, 2 figures, accepted at the Human Motion Generation workshop (HuMoGen) at CVPR 2024

MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

arXiv:2310.05181 [pdf, other]

Unified speech and gesture synthesis using flow matching

Authors: Shivam Mehta, Ruibo Tu, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

Abstract: As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optima… ▽ More As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks. Please see https://shivammehta25.github.io/Match-TTSG/ for video examples and code. △ Less

Submitted 9 January, 2024; v1 submitted 8 October, 2023; originally announced October 2023.

Comments: 5 pages, 1 figure. Final version, accepted to IEEE ICASSP 2024

MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; H.5

arXiv:2309.05455 [pdf, other]

doi 10.1145/3577190.3616117

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Authors: Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow

Abstract: This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these mod… ▽ More This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning. △ Less

Submitted 11 September, 2023; originally announced September 2023.

MSC Class: 68T42 ACM Class: I.2.6; I.2.7

arXiv:2309.03199 [pdf, other]

Matcha-TTS: A fast TTS architecture with conditional flow matching

Authors: Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, Gustav Eje Henter

Abstract: We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic… ▽ More We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models. △ Less

Submitted 9 January, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: 5 pages, 3 figures. Final version, accepted to IEEE ICASSP 2024

MSC Class: 68T07 ACM Class: I.2.7; I.2.6; H.5.5

arXiv:2306.09417 [pdf, other]

doi 10.21437/SSW.2023-24

Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis

Authors: Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter

Abstract: With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous stat… ▽ More With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach. Please see https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code. △ Less

Submitted 9 August, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: 7 pages, 2 figures, presented at the ISCA Speech Synthesis Workshop (SSW) 2023

MSC Class: 68T07 (Primary); 68T42 (Secondary) ACM Class: I.2.7; I.2.6; G.3; H.5.5

arXiv:2211.09707 [pdf, other]

doi 10.1145/3592458

Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models

Authors: Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter

Abstract: Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the Diff… ▽ More Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest. See https://www.speech.kth.se/research/listen-denoise-action/ for video examples, data, and code. △ Less

Submitted 16 May, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 20 pages, 9 figures. Published in ACM ToG and presented at SIGGRAPH 2023

MSC Class: 68T07 ACM Class: G.3; I.2.6; I.3.7; J.5

Journal ref: ACM Trans. Graph. 42, 4 (August 2023), 20 pages

arXiv:2211.06892 [pdf, other]

doi 10.21437/Interspeech.2023-1996

OverFlow: Putting flows on top of neural transducers for better TTS

Authors: Shivam Mehta, Ambika Kirkland, Harm Lameris, Jonas Beskow, Éva Székely, Gustav Eje Henter

Abstract: Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows fo… ▽ More Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech. Please see https://shivammehta25.github.io/OverFlow/ for audio examples and code. △ Less

Submitted 29 May, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

Comments: 5 pages, 2 figures. Accepted for publication at Interspeech 2023

MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

arXiv:2109.01206 [pdf, other]

Mechanical Chameleons: Evaluating the effects of a social robot's non-verbal behavior on social influence

Authors: Patrik Jonell, Anna Deichler, Ilaria Torre, Iolanda Leite, Jonas Beskow

Abstract: In this paper we present a pilot study which investigates how non-verbal behavior affects social influence in social robots. We also present a modular system which is capable of controlling the non-verbal behavior based on the interlocutor's facial gestures (head movements and facial expressions) in real time, and a study investigating whether three different strategies for facial gestures ("still… ▽ More In this paper we present a pilot study which investigates how non-verbal behavior affects social influence in social robots. We also present a modular system which is capable of controlling the non-verbal behavior based on the interlocutor's facial gestures (head movements and facial expressions) in real time, and a study investigating whether three different strategies for facial gestures ("still", "natural movement", i.e. movements recorded from another conversation, and "copy", i.e. mimicking the user with a four second delay) has any affect on social influence and decision making in a "survival task". Our preliminary results show there was no significant difference between the three conditions, but this might be due to among other things a low number of study participants (12). △ Less

Submitted 2 September, 2021; originally announced September 2021.

Comments: In proceedings of SCRITA 2021 (arXiv:2108.08092), a workshop at IEEE RO-MAN 2021: https://ro-man2021.org/

Report number: SCRITA/2021/07

arXiv:2108.13320 [pdf, other]

doi 10.1109/ICASSP43922.2022.9746686

Neural HMMs are all you need (for high-quality attention-free TTS)

Authors: Shivam Mehta, Éva Székely, Jonas Beskow, Gustav Eje Henter

Abstract: Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both wo… ▽ More Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate. △ Less

Submitted 16 February, 2022; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: 5 pages, 2 figures; final version for ICASSP 2022

MSC Class: 68T07 ACM Class: I.2.7; I.2.6; G.3; H.5.5

arXiv:2108.11436 [pdf, other]

doi 10.1145/3462244.3479914

Integrated Speech and Gesture Synthesis

Authors: Siyang Wang, Simon Alexanderson, Joakim Gustafson, Jonas Beskow, Gustav Eje Henter, Éva Székely

Abstract: Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single m… ▽ More Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified from state-of-the-art neural speech-synthesis engines to achieve this goal. We evaluate the models in three carefully-designed user studies, two of which evaluate the synthesized speech and gesture in isolation, plus a combined study that evaluates the models like they will be used in real-world applications -- speech and gesture presented together. The results show that participants rate one of the proposed integrated synthesis models as being as good as the state-of-the-art pipeline system we compare against, in all three tests. The model is able to achieve this with faster synthesis time and greatly reduced parameter count compared to the pipeline system, illustrating some of the potential benefits of treating speech and gesture synthesis together as a single, unified problem. Videos and code are available on our project page at https://swatsw.github.io/isg_icmi21/ △ Less

Submitted 25 August, 2021; originally announced August 2021.

Comments: 9 pages, accepted at ICMI 2021

arXiv:2106.13871 [pdf, other]

doi 10.1145/3478513.3480570

Transflower: probabilistic autoregressive dance generation with multimodal attention

Authors: Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, André Holzapfel, Pierre-Yves Oudeyer, Simon Alexanderson

Abstract: Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic au… ▽ More Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music. △ Less

Submitted 11 June, 2022; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: Article presented at SIGGRAPH Asia 2021, and published in ACM Transactions on Graphics

arXiv:2101.05684 [pdf, other]

doi 10.1145/3383652.3423874

Generating coherent spontaneous speech and gesture from text

Authors: Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscrip… ▽ More Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and text-speech-gesture alignments, and through a demonstration video at https://simonalexanderson.github.io/IVA2020 . △ Less

Submitted 14 January, 2021; originally announced January 2021.

Comments: 3 pages, 2 figures, published at the ACM International Conference on Intelligent Virtual Agents (IVA) 2020

MSC Class: 68T07 ACM Class: I.2.6; J.4; I.3.7; I.2.9

Journal ref: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA '20), 2020, 3 pages

arXiv:2009.10760 [pdf, other]

doi 10.1145/3383652.3423860

Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents

Authors: Patrik Jonell, Taras Kucherenko, Ilaria Torre, Jonas Beskow

Abstract: Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to r… ▽ More Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to run in-lab experiments, because of, for example, a pandemic. Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) or Prolific can therefore be a suitable alternative to run certain experiments, such as evaluating virtual agents. Although previous studies investigated the use of crowdsourcing platforms for running experiments, there is still uncertainty as to whether the results are reliable for perceptual studies. Here we replicate a previous experiment where participants evaluated a gesture generation model for virtual agents. The experiment is conducted across three participant pools -- in-lab, Prolific, and AMT -- having similar demographics across the in-lab participants and the Prolific platform. Our results show no difference between the three participant pools in regards to their evaluations of the gesture generation models and their reliability scores. The results indicate that online platforms can successfully be used for perceptual evaluations of this kind. △ Less

Submitted 23 October, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: Patrik Jonell and Taras Kucherenko contributed equally to this work. Published at the Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent. 8 pages, 7 figures

arXiv:2006.09888 [pdf, other]

doi 10.1145/3383652.3423911

Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings

Authors: Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow

Abstract: To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synt… ▽ More To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior. Those that do, typically use deterministic methods that risk producing repetitive and non-vivid motions. In this paper, we introduce a probabilistic method to synthesize interlocutor-aware facial gestures - represented by highly expressive FLAME parameters - in dyadic conversations. Our contributions are: a) a method for feature extraction from multi-party video and speech recordings, resulting in a representation that allows for independent control and manipulation of expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a recent motion-synthesis method based on normalizing flows, to also take multi-modal signals from the interlocutor as input and subsequently output interlocutor-aware facial gestures; and c) a subjective evaluation assessing the use and relative importance of the input modalities. The results show that the model successfully leverages the input from the interlocutor to generate more appropriate behavior. Videos, data, and code available at: https://jonepatr.github.io/lets_face_it. △ Less

Submitted 22 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: Best Paper Award. 8 pages, 4 figures, IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent

arXiv:1905.06598 [pdf, other]

doi 10.1145/3414685.3417836

MoGlow: Probabilistic and controllable motion synthesis using normalising flows

Authors: Gustav Eje Henter, Simon Alexanderson, Jonas Beskow

Abstract: Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unl… ▽ More Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motion-data models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive, task-specific assumptions regarding the motion or the character morphology. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method outperforms task-agnostic baselines and attains a motion quality close to recorded motion capture. △ Less

Submitted 7 December, 2020; v1 submitted 16 May, 2019; originally announced May 2019.

Comments: 14 pages, 5 figures, published in ACM Transactions on Graphics and presented at SIGGRAPH Asia 2020

ACM Class: I.3.7; G.3; I.2.6

Journal ref: ACM Trans. Graph. 39, 4, Article 236 (November 2020), 14 pages

arXiv:1901.10461 [pdf, other]

The effect of a physical robot on vocabulary learning

Authors: Andreas Wedenborn, Preben Wik, Olov Engwall, Jonas Beskow

Abstract: This study investigates the effect of a physical robot taking the role of a teacher or exercise partner in a language learning exercise. In order to investigate this, an application was developed enabling a 2:nd language learning vocabulary exercise in three different conditions. In the first condition the learner would receive tutoring from a disembodied voice, in the second condition the tutor w… ▽ More This study investigates the effect of a physical robot taking the role of a teacher or exercise partner in a language learning exercise. In order to investigate this, an application was developed enabling a 2:nd language learning vocabulary exercise in three different conditions. In the first condition the learner would receive tutoring from a disembodied voice, in the second condition the tutor would be embodied by an animated avatar on a computer screen, and in the final condition the tutor was a physical robotic head with a 3D animated face mask. A Russian language vocabulary exercise with 15 subjects was conducted. None of the subjects reported any Russian language skills prior to the exercises. Each subject were taught a set of 9 words in each of the three conditions during a practice phase, and were then asked to recall the words in a test phase. Results show that the recall of the words practiced with the physical robot were significantly higher than that of the words practiced with the avatar on the screen or with the disembodied voice. △ Less

Submitted 30 January, 2019; originally announced January 2019.

Comments: This study was presented at the International Workshop on Spoken Dialogue Systems (IWSDS 2016), Saariselkä, Finland., January 2016

arXiv:1803.02665 [pdf, other]

A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture

Authors: Taras Kucherenko, Jonas Beskow, Hedvig Kjellström

Abstract: Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing… ▽ More Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing from the reconstruction. In this paper, we propose to use a neural network approach to learn how human motion is temporally and spatially correlated, and reconstruct missing markers positions through this model. We experiment with two different models, one LSTM-based and one time-window-based. Both methods produce state-of-the-art results, while working online, as opposed to most of the alternative methods, which require the complete sequence to be known. The implementation is publicly available at https://github.com/Svito-zar/NN-for-Missing-Marker-Reconstruction . △ Less

Submitted 25 September, 2018; v1 submitted 7 March, 2018; originally announced March 2018.

Comments: 7 pages, 6 figures

MSC Class: 68T05

arXiv:1711.08992 [pdf, other]

doi 10.1109/TCDS.2019.2927941

Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition

Authors: Kalin Stefanov, Jonas Beskow, Giampiero Salvi

Abstract: This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robus… ▽ More This paper presents a self-supervised method for visual detection of the active speaker in a multi-person spoken interaction scenario. Active speaker detection is a fundamental prerequisite for any artificial cognitive system attempting to acquire language in social settings. The proposed method is intended to complement the acoustic detection of the active speaker, thus improving the system robustness in noisy conditions. The method can detect an arbitrary number of possibly overlap** active speakers based exclusively on visual information about their face. Furthermore, the method does not rely on external annotations, thus complying with cognitive development. Instead, the method uses information from the auditory modality to support learning in the visual domain. This paper reports an extensive evaluation of the proposed method using a large multi-person face-to-face interaction dataset. The results show good performance in a speaker dependent setting. However, in a speaker independent setting the proposed method yields a significantly lower performance. We believe that the proposed method represents an essential component of any artificial cognitive system or robotic platform engaging in social interactions. △ Less

Submitted 18 July, 2019; v1 submitted 24 November, 2017; originally announced November 2017.

Comments: 10 pages, IEEE Transactions on Cognitive and Developmental Systems

ACM Class: I.2; I.4; I.5

arXiv:1709.01613 [pdf, other]

Machine Learning and Social Robotics for Detecting Early Signs of Dementia

Authors: Patrik Jonell, Joseph Mendelson, Thomas Storskog, Goran Hagman, Per Ostberg, Iolanda Leite, Taras Kucherenko, Olga Mikheeva, Ulrika Akenine, Vesna Jelic, Alina Solomon, Jonas Beskow, Joakim Gustafson, Miia Kivipelto, Hedvig Kjellstrom

Abstract: This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The inte… ▽ More This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The interaction will be developed using a participatory design approach. We describe the scope and method of the project, and report on a first Wizard of Oz prototype. △ Less

Submitted 5 September, 2017; originally announced September 2017.

arXiv:1211.3901 [pdf, other]

Visual Recognition of Isolated Swedish Sign Language Signs

Authors: Saad Akram, Jonas Beskow, Hedvig Kjellstrom

Abstract: We present a method for recognition of isolated Swedish Sign Language signs. The method will be used in a game intended to help children training signing at home, as a complement to training with a teacher. The target group is not primarily deaf children, but children with language disorders. Using sign language as a support in conversation has been shown to greatly stimulate the speech developmen… ▽ More We present a method for recognition of isolated Swedish Sign Language signs. The method will be used in a game intended to help children training signing at home, as a complement to training with a teacher. The target group is not primarily deaf children, but children with language disorders. Using sign language as a support in conversation has been shown to greatly stimulate the speech development of such children. The signer is captured with an RGB-D (Kinect) sensor, which has three advantages over a regular RGB camera. Firstly, it allows complex backgrounds to be removed easily. We segment the hands and face based on skin color and depth information. Secondly, it helps with the resolution of hand over face occlusion. Thirdly, signs take place in 3D; some aspects of the signs are defined by hand motion vertically to the image plane. This motion can be estimated if the depth is observable. The 3D motion of the hands relative to the torso are used as a cue together with the hand shape, and HMMs trained with this input are used for classification. To obtain higher robustness towards differences across signers, Fisher Linear Discriminant Analysis is used to find the combinations of features that are most descriptive for each sign, regardless of signer. Experiments show that the system can distinguish signs from a challenging 94 word vocabulary with a precision of up to 94% in the signer dependent case and up to 47% in the signer independent case. △ Less

Submitted 16 November, 2012; originally announced November 2012.

Showing 1–21 of 21 results for author: Beskow, J