-
Realization of Anosov Diffeomorphisms on the Torus
Authors:
Tamara Kucherenko,
Anthony Quas
Abstract:
We study area preserving Anosov maps on the two-dimensional torus within a fixed homotopy class. We show that the set of pressure functions for Anosov diffeomorphisms with respect to the geometric potential is equal to the set of pressure functions for the linear Anosov automorphism with respect to Hölder potentials. We use this result to provide a negative answer to the $C^{1+α}$ version of the q…
▽ More
We study area preserving Anosov maps on the two-dimensional torus within a fixed homotopy class. We show that the set of pressure functions for Anosov diffeomorphisms with respect to the geometric potential is equal to the set of pressure functions for the linear Anosov automorphism with respect to Hölder potentials. We use this result to provide a negative answer to the $C^{1+α}$ version of the question posed by Rodriguez Hertz on whether two homotopic area preserving $C^\infty$ Anosov difeomorphisms whose geometric potentials have identical pressure functions must be $C^\infty$ conjugate.
△ Less
Submitted 3 July, 2024;
originally announced July 2024.
-
Ergodic theory on coded shift spaces
Authors:
Tamara Kucherenko,
Martin Schmoll,
Christian Wolf
Abstract:
We study ergodic-theoretic properties of coded shift spaces. A coded shift space is defined as a closure of all bi-infinite concatenations of words from a fixed countable generating set. We derive sufficient conditions for the uniqueness of measures of maximal entropy and equilibrium states of Hoelder continuous potentials based on the partition of the coded shift into its concatenation set (seque…
▽ More
We study ergodic-theoretic properties of coded shift spaces. A coded shift space is defined as a closure of all bi-infinite concatenations of words from a fixed countable generating set. We derive sufficient conditions for the uniqueness of measures of maximal entropy and equilibrium states of Hoelder continuous potentials based on the partition of the coded shift into its concatenation set (sequences that are concatenations of generating words) and its residual set (sequences added under the closure). In this case we provide a simple explicit description of the measure of maximal entropy. We also obtain flexibility results for the entropy on the concatenation and residual sets. Finally, we prove a local structure theorem for intrinsically ergodic coded shift spaces which shows that our results apply to a larger class of coded shift spaces compared to previous works by Climenhaga, Climenhaga and Thompson, and Pavlov.
△ Less
Submitted 9 July, 2024; v1 submitted 28 October, 2023;
originally announced October 2023.
-
The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings
Authors:
Taras Kucherenko,
Rajmund Nagy,
Youngwoo Yoon,
Jieyeon Woo,
Teodor Nikolov,
Mihail Tsakov,
Gustav Eje Henter
Abstract:
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the int…
▽ More
This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the interlocutor. We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies. The studies focused on three aspects: 1) the human-likeness of the motion, 2) the appropriateness of the motion for the agent's own speech whilst controlling for the human-likeness of the motion, and 3) the appropriateness of the motion for the behaviour of the interlocutor in the interaction, using a setup that controls for both the human-likeness of the motion and the agent's own speech. We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap. Appropriateness seems far from being solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor. Additional material is available via the project website at https://svito-zar.github.io/GENEAchallenge2023/ .
△ Less
Submitted 24 August, 2023;
originally announced August 2023.
-
Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022
Authors:
Taras Kucherenko,
Pieter Wolfert,
Youngwoo Yoon,
Carla Viegas,
Teodor Nikolov,
Mihail Tsakov,
Gustav Eje Henter
Abstract:
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff…
▽ More
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field.
The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fréchet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation.
△ Less
Submitted 28 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
Asymptotic behavior of the pressure function for Hölder potentials
Authors:
Tamara Kucherenko,
Anthony Quas
Abstract:
We study the behavior of the pressure function for Hölder continuous potentials on mixing subshifts of finite type. The classical theory of thermodynamic formalism shows that such pressure functions are convex, analytic and have slant asymptotes. We provide a sharp exponential lower bound on how fast the pressure function approaches its asymptotes. As a counterpart, we also show that there is no c…
▽ More
We study the behavior of the pressure function for Hölder continuous potentials on mixing subshifts of finite type. The classical theory of thermodynamic formalism shows that such pressure functions are convex, analytic and have slant asymptotes. We provide a sharp exponential lower bound on how fast the pressure function approaches its asymptotes. As a counterpart, we also show that there is no corresponding upper bound by exhibiting systems for which the convergence is arbitrarily slow. However, we prove that the exponential upper bound still holds for a generic Hölder potential. In addition, we determine that the pressure function satisfies a coarse uniform convexity property. Asymptotic bounds and quantitative convexity estimates are the first additional general properties of the pressure function obtained in the settings of Bowen and Ruelle since their groundbreaking work more than 40 years ago.
△ Less
Submitted 28 February, 2023;
originally announced February 2023.
-
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Authors:
Simbarashe Nyatsanga,
Taras Kucherenko,
Chaitanya Ahuja,
Gustav Eje Henter,
Michael Neff
Abstract:
Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic n…
▽ More
Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.
△ Less
Submitted 10 April, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents through Real-Time Interaction
Authors:
Yuan He,
André Pereira,
Taras Kucherenko
Abstract:
Embodied Conversational Agents that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and related methods have continuously improved. Real-time interaction is typically used when researchers evaluate ECA systems that generate rule-based gestures. How…
▽ More
Embodied Conversational Agents that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and related methods have continuously improved. Real-time interaction is typically used when researchers evaluate ECA systems that generate rule-based gestures. However, when evaluating the performance of ECAs based on data-driven methods, participants are often required only to watch pre-recorded videos, which cannot provide adequate information about what a person perceives during the interaction. To address this limitation, we explored use of real-time interaction to assess data-driven gesturing ECAs. We provided a testbed framework, and investigated whether gestures could affect human perception of ECAs in the dimensions of human-likeness, animacy, perceived intelligence, and focused attention. Our user study required participants to interact with two ECAs - one with and one without hand gestures. We collected subjective data from the participants' self-report questionnaires and objective data from a gaze tracker. To our knowledge, the current study represents the first attempt to evaluate data-driven gesturing ECAs through real-time interaction and the first experiment using gaze-tracking to examine the effect of ECAs' gestures.
△ Less
Submitted 13 October, 2022;
originally announced October 2022.
-
The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
Authors:
Youngwoo Yoon,
Pieter Wolfert,
Taras Kucherenko,
Carla Viegas,
Teodor Nikolov,
Mihail Tsakov,
Gustav Eje Henter
Abstract:
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff…
▽ More
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year's dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field.
The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. Additional material is available via the project website at https://youngwoo-yoon.github.io/GENEAchallenge2022/
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
Multimodal analysis of the predictability of hand-gesture properties
Authors:
Taras Kucherenko,
Rajmund Nagy,
Michael Neff,
Hedvig Kjellström,
Gustav Eje Henter
Abstract:
Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/o…
▽ More
Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model.
△ Less
Submitted 14 January, 2022; v1 submitted 12 August, 2021;
originally announced August 2021.
-
To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures
Authors:
Pieter Wolfert,
Jeffrey M. Girard,
Taras Kucherenko,
Tony Belpaeme
Abstract:
While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely dist…
▽ More
While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings.
△ Less
Submitted 13 August, 2021; v1 submitted 12 August, 2021;
originally announced August 2021.
-
Flexibility of the Pressure Function
Authors:
Tamara Kucherenko,
Anthony Quas
Abstract:
We study the flexibility of the pressure function of a continuous potential (observable) with respect to a parameter regarded as the inverse temperature. The points of non-differentiability of this function are of particular interest in statistical physics, since they correspond to phase transitions. It is well known that the pressure function is convex, Lipschitz, and has an asymptote at infinity…
▽ More
We study the flexibility of the pressure function of a continuous potential (observable) with respect to a parameter regarded as the inverse temperature. The points of non-differentiability of this function are of particular interest in statistical physics, since they correspond to phase transitions. It is well known that the pressure function is convex, Lipschitz, and has an asymptote at infinity. We prove that in a setting of one-dimensional compact symbolic systems these are the only restrictions. We present a method to explicitly construct a continuous potential whose pressure function coincides with \emph{any} prescribed convex Lipschitz asymptotically linear function starting at a given positive value of the parameter. In fact, we establish a multidimensional version of this result. As a consequence, we obtain that for a continuous observable the phase transitions can occur at a countable dense set of temperature values. We go further and show that one can vary the cardinality of the set of ergodic equilibrium states as a function of the parameter to be any number, finite or infinite.
△ Less
Submitted 28 February, 2023; v1 submitted 1 August, 2021;
originally announced August 2021.
-
Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech
Authors:
Taras Kucherenko,
Rajmund Nagy,
Patrik Jonell,
Michael Neff,
Hedvig Kjellström,
Gustav Eje Henter
Abstract:
We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to gener…
▽ More
We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to generate gestures that are both diverse and representational. Follow-ups and more information can be found on the project page: https://svito-zar.github.io/speech2properties2gestures/ .
△ Less
Submitted 13 August, 2021; v1 submitted 28 June, 2021;
originally announced June 2021.
-
A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents
Authors:
Rajmund Nagy,
Taras Kucherenko,
Birger Moell,
André Pereira,
Hedvig Kjellström,
Ulysses Bernardet
Abstract:
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation - hand and arm movements accompanying speech - is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture ge…
▽ More
Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation - hand and arm movements accompanying speech - is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users. We present a proof-of-concept framework, which is intended to facilitate evaluation of modern gesture generation models in interaction.
We demonstrate an extensible open-source framework that contains three components: 1) a 3D interactive agent; 2) a chatbot backend; 3) a gesticulating system. Each component can be replaced, making the proposed framework applicable for investigating the effect of different gesturing models in real-time interactions with different communication modalities, chatbot backends, or different agent appearances. The code and video are available at the project page https://nagyrajmund.github.io/project/gesturebot.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020
Authors:
Taras Kucherenko,
Patrik Jonell,
Youngwoo Yoon,
Pieter Wolfert,
Gustav Eje Henter
Abstract:
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual resea…
▽ More
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
HEMVIP: Human Evaluation of Multiple Videos in Parallel
Authors:
Patrik Jonell,
Youngwoo Yoon,
Pieter Wolfert,
Taras Kucherenko,
Gustav Eje Henter
Abstract:
In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scor…
▽ More
In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources required for such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating the quality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel. This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool are in close agreement with results of prior studies using conventional multiple pairwise comparisons.
△ Less
Submitted 20 October, 2021; v1 submitted 28 January, 2021;
originally announced January 2021.
-
Generating coherent spontaneous speech and gesture from text
Authors:
Simon Alexanderson,
Éva Székely,
Gustav Eje Henter,
Taras Kucherenko,
Jonas Beskow
Abstract:
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscrip…
▽ More
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and text-speech-gesture alignments, and through a demonstration video at https://simonalexanderson.github.io/IVA2020 .
△ Less
Submitted 14 January, 2021;
originally announced January 2021.
-
Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents
Authors:
Patrik Jonell,
Taras Kucherenko,
Ilaria Torre,
Jonas Beskow
Abstract:
Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to r…
▽ More
Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to run in-lab experiments, because of, for example, a pandemic. Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) or Prolific can therefore be a suitable alternative to run certain experiments, such as evaluating virtual agents. Although previous studies investigated the use of crowdsourcing platforms for running experiments, there is still uncertainty as to whether the results are reliable for perceptual studies. Here we replicate a previous experiment where participants evaluated a gesture generation model for virtual agents. The experiment is conducted across three participant pools -- in-lab, Prolific, and AMT -- having similar demographics across the in-lab participants and the Prolific platform. Our results show no difference between the three participant pools in regards to their evaluations of the gesture generation models and their reliability scores. The results indicate that online platforms can successfully be used for perceptual evaluations of this kind.
△ Less
Submitted 23 October, 2020; v1 submitted 22 September, 2020;
originally announced September 2020.
-
Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation
Authors:
Taras Kucherenko,
Dai Hasegawa,
Naoshi Kaneko,
Gustav Eje Henter,
Hedvig Kjellström
Abstract:
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordina…
▽ More
This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyse the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
△ Less
Submitted 28 January, 2021; v1 submitted 16 July, 2020;
originally announced July 2020.
-
Multiple phase transitions on compact symbolic systems
Authors:
Tamara Kucherenko,
Anthony Quas,
Christian Wolf
Abstract:
Let $φ:X\to \mathbb R$ be a continuous potential associated with a symbolic dynamical system $T:X\to X$ over a finite alphabet. Introducing a parameter $β>0$ (interpreted as the inverse temperature) we study the regularity of the pressure function $β\mapsto P_{\rm top}(βφ)$ on an interval $[α,\infty)$ with $α>0$. We say that $φ$ has a phase transition at $β_0$ if the pressure function…
▽ More
Let $φ:X\to \mathbb R$ be a continuous potential associated with a symbolic dynamical system $T:X\to X$ over a finite alphabet. Introducing a parameter $β>0$ (interpreted as the inverse temperature) we study the regularity of the pressure function $β\mapsto P_{\rm top}(βφ)$ on an interval $[α,\infty)$ with $α>0$. We say that $φ$ has a phase transition at $β_0$ if the pressure function $P_{\rm top}(βφ)$ is not differentiable at $β_0$. This is equivalent to the condition that the potential $β_0φ$ has two (ergodic) equilibrium states with distinct entropies. For any $α>0$ and any increasing sequence of real numbers $(β_n)$ contained in $[α,\infty)$, we construct a potential $φ$ whose phase transitions in $[α,\infty)$ occur precisely at the $β_n$'s. In particular, we obtain a potential which has a countably infinite set of phase transitions.
△ Less
Submitted 5 September, 2020; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings
Authors:
Patrik Jonell,
Taras Kucherenko,
Gustav Eje Henter,
Jonas Beskow
Abstract:
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synt…
▽ More
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior. Those that do, typically use deterministic methods that risk producing repetitive and non-vivid motions. In this paper, we introduce a probabilistic method to synthesize interlocutor-aware facial gestures - represented by highly expressive FLAME parameters - in dyadic conversations. Our contributions are: a) a method for feature extraction from multi-party video and speech recordings, resulting in a representation that allows for independent control and manipulation of expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a recent motion-synthesis method based on normalizing flows, to also take multi-modal signals from the interlocutor as input and subsequently output interlocutor-aware facial gestures; and c) a subjective evaluation assessing the use and relative importance of the input modalities. The results show that the model successfully leverages the input from the interlocutor to generate more appropriate behavior. Videos, data, and code available at: https://jonepatr.github.io/lets_face_it.
△ Less
Submitted 22 October, 2020; v1 submitted 11 June, 2020;
originally announced June 2020.
-
Gesticulator: A framework for semantically-aware speech-driven gesture generation
Authors:
Taras Kucherenko,
Patrik Jonell,
Sanne van Waveren,
Gustav Eje Henter,
Simon Alexanderson,
Iolanda Leite,
Hedvig Kjellström
Abstract:
During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoust…
▽ More
During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https://svito-zar.github.io/gesticulator .
△ Less
Submitted 14 January, 2021; v1 submitted 25 January, 2020;
originally announced January 2020.
-
Measures of maximal entropy on subsystems of topological suspension semi-flows
Authors:
Tamara Kucherenko,
Daniel J. Thompson
Abstract:
Given a compact topological dynamical system (X, f) with positive entropy and upper semi-continuous entropy map, and any closed invariant subset $Y \subset X$ with positive entropy, we show that there exists a continuous roof function such that the set of measures of maximal entropy for the suspension semi-flow over (X,f) consists precisely of the lifts of measures which maximize entropy on Y. Thi…
▽ More
Given a compact topological dynamical system (X, f) with positive entropy and upper semi-continuous entropy map, and any closed invariant subset $Y \subset X$ with positive entropy, we show that there exists a continuous roof function such that the set of measures of maximal entropy for the suspension semi-flow over (X,f) consists precisely of the lifts of measures which maximize entropy on Y. This result has a number of implications for the possible size of the set of measures of maximal entropy for topological suspension flows. In particular, for a suspension flow on the full shift on a finite alphabet, the set of ergodic measures of maximal entropy may be countable, uncountable, or have any finite cardinality.
△ Less
Submitted 27 January, 2021; v1 submitted 16 September, 2019;
originally announced September 2019.
-
Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Authors:
Taras Kucherenko,
Dai Hasegawa,
Gustav Eje Henter,
Naoshi Kaneko,
Hedvig Kjellström
Abstract:
This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a s…
▽ More
This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.
△ Less
Submitted 11 June, 2019; v1 submitted 8 March, 2019;
originally announced March 2019.
-
A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture
Authors:
Taras Kucherenko,
Jonas Beskow,
Hedvig Kjellström
Abstract:
Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing…
▽ More
Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing from the reconstruction. In this paper, we propose to use a neural network approach to learn how human motion is temporally and spatially correlated, and reconstruct missing markers positions through this model. We experiment with two different models, one LSTM-based and one time-window-based. Both methods produce state-of-the-art results, while working online, as opposed to most of the alternative methods, which require the complete sequence to be known. The implementation is publicly available at https://github.com/Svito-zar/NN-for-Missing-Marker-Reconstruction .
△ Less
Submitted 25 September, 2018; v1 submitted 7 March, 2018;
originally announced March 2018.
-
Machine Learning and Social Robotics for Detecting Early Signs of Dementia
Authors:
Patrik Jonell,
Joseph Mendelson,
Thomas Storskog,
Goran Hagman,
Per Ostberg,
Iolanda Leite,
Taras Kucherenko,
Olga Mikheeva,
Ulrika Akenine,
Vesna Jelic,
Alina Solomon,
Jonas Beskow,
Joakim Gustafson,
Miia Kivipelto,
Hedvig Kjellstrom
Abstract:
This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The inte…
▽ More
This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The interaction will be developed using a participatory design approach. We describe the scope and method of the project, and report on a first Wizard of Oz prototype.
△ Less
Submitted 5 September, 2017;
originally announced September 2017.
-
Measures of maximal entropy for suspension flows over the full shift
Authors:
Tamara Kucherenko,
Daniel J. Thompson
Abstract:
We consider suspension flows with continuous roof function over the full shift $Σ$ on a finite alphabet. For any positive entropy subshift of finite type $Y \subset Σ$, we explictly construct a roof function such that the measure(s) of maximal entropy for the suspension flow over $Σ$ are exactly the lifts of the measure(s) of maximal entropy for $Y$. In the case when $Y$ is transitive, this gives…
▽ More
We consider suspension flows with continuous roof function over the full shift $Σ$ on a finite alphabet. For any positive entropy subshift of finite type $Y \subset Σ$, we explictly construct a roof function such that the measure(s) of maximal entropy for the suspension flow over $Σ$ are exactly the lifts of the measure(s) of maximal entropy for $Y$. In the case when $Y$ is transitive, this gives a unique measure of maximal entropy for the flow which is not fully supported. If $Y$ has more than one transitive component, all with the same entropy, this gives explicit examples of suspension flows over the full shift with multiple measures of maximal entropy. This contrasts with the case of a Hölder continuous roof function where it is well known the measure of maximal entropy is unique and fully supported.
△ Less
Submitted 12 March, 2019; v1 submitted 1 August, 2017;
originally announced August 2017.
-
Ground States and Zero-Temperature Measures at the Boundary of Rotation Sets
Authors:
Tamara Kucherenko,
Christian Wolf
Abstract:
We consider a continuous dynamical system $f:X\to X$ on a compact metric space $X$ equipped with an $m$-dimensional continuous potential $Φ=(φ_1,\cdots,φ_m):X\to \bR^m$. We study the set of ground states $ GS(α)$ of the potential $α\cdot Φ$ as a function of the direction vector $α\in S^{m-1}$. %We also study the corresponding rotation vectors $\rv(GS(α))$. We show that the structure of the ground…
▽ More
We consider a continuous dynamical system $f:X\to X$ on a compact metric space $X$ equipped with an $m$-dimensional continuous potential $Φ=(φ_1,\cdots,φ_m):X\to \bR^m$. We study the set of ground states $ GS(α)$ of the potential $α\cdot Φ$ as a function of the direction vector $α\in S^{m-1}$. %We also study the corresponding rotation vectors $\rv(GS(α))$. We show that the structure of the ground state sets is naturally related to the geometry of the generalized rotation set of $Φ$. In particular, for each $α$ the set of rotation vectors of $ GS(α)$ forms a non-empty, compact and connected subset of a face $F_α(Φ)$ of the rotation set associated with $α$. Moreover, every ground state maximizes entropy among all invariant measures with rotation vectors in $F_α(Φ)$. We further establish the occurrence of several quite unexpected phenomena. Namely, we construct for any $m\in\bN$ examples with an exposed boundary point (i.e. $F_α(Φ)$ being a singleton) without a unique ground state. Further, we establish the possibility of a line segment face $F_α(Φ)$ with a unique but non-ergodic ground state. Finally, we establish the possibility that the set of rotation vectors of $GS(α)$ is a non-trivial line segment.
△ Less
Submitted 21 April, 2016;
originally announced April 2016.
-
Localized topological pressure and equilibrium states
Authors:
Tamara Kucherenko,
Christian Wolf
Abstract:
We introduce the notion of localized topological pressure for continuous maps on compact metric spaces. The localized pressure of a continuous potential $\varphi$ is computed by considering only those $(n,ε)$-separated sets whose statistical sums with respect to an $m$-dimensional potential $Φ$ are "close" to a given value $w\in \bR^m$. We then establish for several classes of systems and potentia…
▽ More
We introduce the notion of localized topological pressure for continuous maps on compact metric spaces. The localized pressure of a continuous potential $\varphi$ is computed by considering only those $(n,ε)$-separated sets whose statistical sums with respect to an $m$-dimensional potential $Φ$ are "close" to a given value $w\in \bR^m$. We then establish for several classes of systems and potentials $\varphi$ and $Φ$ a local version of the variational principle. We also construct examples showing that the assumptions in the localized variational principle are fairly sharp. Next, we study localized equilibrium states and show that even in the case of subshifts of finite type and Hölder continuous potentials, there are several new phenomena that do not occur in the theory of classical equilibrium states. In particular, ergodic localized equilibrium states for Hölder continuous potentials are in general not unique.
△ Less
Submitted 15 October, 2013;
originally announced October 2013.
-
Geometry and entropy of generalized rotation sets
Authors:
Tamara Kucherenko,
Christian Wolf
Abstract:
For a continuous map $f$ on a compact metric space we study the geometry and entropy of the generalized rotation set $\R(Φ)$. Here $Φ=(φ_1,...,φ_m)$ is a $m$-dimensional continuous potential and $\R(Φ)$ is the set of all $μ$-integrals of $Φ$ and $μ$ runs over all $f$-invariant probability measures. It is easy to see that the rotation set is a compact and convex subset of $\bR^m$. We study the ques…
▽ More
For a continuous map $f$ on a compact metric space we study the geometry and entropy of the generalized rotation set $\R(Φ)$. Here $Φ=(φ_1,...,φ_m)$ is a $m$-dimensional continuous potential and $\R(Φ)$ is the set of all $μ$-integrals of $Φ$ and $μ$ runs over all $f$-invariant probability measures. It is easy to see that the rotation set is a compact and convex subset of $\bR^m$. We study the question if every compact and convex set is attained as a rotation set of a particular set of potentials within a particular class of dynamical systems. We give a positive answer in the case of subshifts of finite type by constructing for every compact and convex set $K$ in $\bR^m$ a potential $Φ=Φ(K)$ with $\R(Φ)=K$. Next, we study the relation between $\R(Φ)$ and the set of all statistical limits $\R_{Pt}(Φ)$. We show that in general these sets differ but also provide criteria that guarantee $\R(Φ)= \R_{Pt}(Φ)$. Finally, we study the entropy function $w\mapsto H(w), w\in \R(Φ)$. We establish a variational principle for the entropy function and show that for certain non-uniformly hyperbolic systems $H(w)$ is determined by the growth rate of those hyperbolic periodic orbits whose $Φ$-integrals are close to $w$. We also show that for systems with strong thermodynamic properties (subshifts of finite type, hyperbolic systems and expansive homeomorphisms with specification, etc.) the entropy function $w\mapsto H(w)$ is real-analytic in the interior of the rotation set.
△ Less
Submitted 29 September, 2012;
originally announced October 2012.