Search | arXiv e-print repository

Realization of Anosov Diffeomorphisms on the Torus

Authors: Tamara Kucherenko, Anthony Quas

Abstract: We study area preserving Anosov maps on the two-dimensional torus within a fixed homotopy class. We show that the set of pressure functions for Anosov diffeomorphisms with respect to the geometric potential is equal to the set of pressure functions for the linear Anosov automorphism with respect to Hölder potentials. We use this result to provide a negative answer to the $C^{1+α}$ version of the q… ▽ More We study area preserving Anosov maps on the two-dimensional torus within a fixed homotopy class. We show that the set of pressure functions for Anosov diffeomorphisms with respect to the geometric potential is equal to the set of pressure functions for the linear Anosov automorphism with respect to Hölder potentials. We use this result to provide a negative answer to the $C^{1+α}$ version of the question posed by Rodriguez Hertz on whether two homotopic area preserving $C^\infty$ Anosov difeomorphisms whose geometric potentials have identical pressure functions must be $C^\infty$ conjugate. △ Less

Submitted 3 July, 2024; originally announced July 2024.

MSC Class: 37D35; 37B10; 37A60; 37C15; 37D20

arXiv:2310.18855 [pdf, ps, other]

Ergodic theory on coded shift spaces

Authors: Tamara Kucherenko, Martin Schmoll, Christian Wolf

Abstract: We study ergodic-theoretic properties of coded shift spaces. A coded shift space is defined as a closure of all bi-infinite concatenations of words from a fixed countable generating set. We derive sufficient conditions for the uniqueness of measures of maximal entropy and equilibrium states of Hoelder continuous potentials based on the partition of the coded shift into its concatenation set (seque… ▽ More We study ergodic-theoretic properties of coded shift spaces. A coded shift space is defined as a closure of all bi-infinite concatenations of words from a fixed countable generating set. We derive sufficient conditions for the uniqueness of measures of maximal entropy and equilibrium states of Hoelder continuous potentials based on the partition of the coded shift into its concatenation set (sequences that are concatenations of generating words) and its residual set (sequences added under the closure). In this case we provide a simple explicit description of the measure of maximal entropy. We also obtain flexibility results for the entropy on the concatenation and residual sets. Finally, we prove a local structure theorem for intrinsically ergodic coded shift spaces which shows that our results apply to a larger class of coded shift spaces compared to previous works by Climenhaga, Climenhaga and Thompson, and Pavlov. △ Less

Submitted 9 July, 2024; v1 submitted 28 October, 2023; originally announced October 2023.

Comments: 42 pages

MSC Class: 37A35; 37B10; 37B40; 37D35

arXiv:2308.12646 [pdf, other]

The GENEA Challenge 2023: A large scale evaluation of gesture generation models in monadic and dyadic settings

Authors: Taras Kucherenko, Rajmund Nagy, Youngwoo Yoon, Jieyeon Woo, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

Abstract: This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the int… ▽ More This paper reports on the GENEA Challenge 2023, in which participating teams built speech-driven gesture-generation systems using the same speech and motion dataset, followed by a joint evaluation. This year's challenge provided data on both sides of a dyadic interaction, allowing teams to generate full-body motion for an agent given its speech (text and audio) and the speech and motion of the interlocutor. We evaluated 12 submissions and 2 baselines together with held-out motion-capture data in several large-scale user studies. The studies focused on three aspects: 1) the human-likeness of the motion, 2) the appropriateness of the motion for the agent's own speech whilst controlling for the human-likeness of the motion, and 3) the appropriateness of the motion for the behaviour of the interlocutor in the interaction, using a setup that controls for both the human-likeness of the motion and the agent's own speech. We found a large span in human-likeness between challenge submissions, with a few systems rated close to human mocap. Appropriateness seems far from being solved, with most submissions performing in a narrow range slightly above chance, far behind natural motion. The effect of the interlocutor is even more subtle, with submitted systems at best performing barely above chance. Interestingly, a dyadic system being highly appropriate for agent speech does not necessarily imply high appropriateness for the interlocutor. Additional material is available via the project website at https://svito-zar.github.io/GENEAchallenge2023/ . △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: The first three authors made equal contributions. Accepted for publication at the ACM International Conference on Multimodal Interaction (ICMI)

ACM Class: I.3; I.2

arXiv:2303.08737 [pdf, other]

doi 10.1145/3656374

Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022

Authors: Taras Kucherenko, Pieter Wolfert, Youngwoo Yoon, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff… ▽ More This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fréchet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation. △ Less

Submitted 28 March, 2024; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: The first three authors made equal contributions and share joint first authorship. Accepted for publication in the ACM Transactions on Graphics (TOG).Please see https://youngwoo-yoon.github.io/GENEAchallenge2022/ for all challenge materials. arXiv admin note: text overlap with arXiv:2208.10441

ACM Class: I.3; I.2

arXiv:2302.14839 [pdf, ps, other]

Asymptotic behavior of the pressure function for Hölder potentials

Authors: Tamara Kucherenko, Anthony Quas

Abstract: We study the behavior of the pressure function for Hölder continuous potentials on mixing subshifts of finite type. The classical theory of thermodynamic formalism shows that such pressure functions are convex, analytic and have slant asymptotes. We provide a sharp exponential lower bound on how fast the pressure function approaches its asymptotes. As a counterpart, we also show that there is no c… ▽ More We study the behavior of the pressure function for Hölder continuous potentials on mixing subshifts of finite type. The classical theory of thermodynamic formalism shows that such pressure functions are convex, analytic and have slant asymptotes. We provide a sharp exponential lower bound on how fast the pressure function approaches its asymptotes. As a counterpart, we also show that there is no corresponding upper bound by exhibiting systems for which the convergence is arbitrarily slow. However, we prove that the exponential upper bound still holds for a generic Hölder potential. In addition, we determine that the pressure function satisfies a coarse uniform convexity property. Asymptotic bounds and quantitative convexity estimates are the first additional general properties of the pressure function obtained in the settings of Bowen and Ruelle since their groundbreaking work more than 40 years ago. △ Less

Submitted 28 February, 2023; originally announced February 2023.

MSC Class: 37D35; 37B10; 37A60

arXiv:2301.05339 [pdf, other]

doi 10.1111/cgf.14776

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

Authors: Simbarashe Nyatsanga, Taras Kucherenko, Chaitanya Ahuja, Gustav Eje Henter, Michael Neff

Abstract: Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic n… ▽ More Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology in film, games, virtual social spaces, and for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models, that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text, and non-linguistic input. We also chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method. Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development. △ Less

Submitted 10 April, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: Accepted for EUROGRAPHICS 2023

ACM Class: I.3.7

arXiv:2210.06974 [pdf, other]

doi 10.1145/3514197.3549697

Evaluating Data-Driven Co-Speech Gestures of Embodied Conversational Agents through Real-Time Interaction

Authors: Yuan He, André Pereira, Taras Kucherenko

Abstract: Embodied Conversational Agents that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and related methods have continuously improved. Real-time interaction is typically used when researchers evaluate ECA systems that generate rule-based gestures. How… ▽ More Embodied Conversational Agents that make use of co-speech gestures can enhance human-machine interactions in many ways. In recent years, data-driven gesture generation approaches for ECAs have attracted considerable research attention, and related methods have continuously improved. Real-time interaction is typically used when researchers evaluate ECA systems that generate rule-based gestures. However, when evaluating the performance of ECAs based on data-driven methods, participants are often required only to watch pre-recorded videos, which cannot provide adequate information about what a person perceives during the interaction. To address this limitation, we explored use of real-time interaction to assess data-driven gesturing ECAs. We provided a testbed framework, and investigated whether gestures could affect human perception of ECAs in the dimensions of human-likeness, animacy, perceived intelligence, and focused attention. Our user study required participants to interact with two ECAs - one with and one without hand gestures. We collected subjective data from the participants' self-report questionnaires and objective data from a gaze tracker. To our knowledge, the current study represents the first attempt to evaluate data-driven gesturing ECAs through real-time interaction and the first experiment using gaze-tracking to examine the effect of ECAs' gestures. △ Less

Submitted 13 October, 2022; originally announced October 2022.

Comments: Published at the International Conference on Intelligent Virtual Agents

arXiv:2208.10441 [pdf, other]

doi 10.1145/3536221.3558058

The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation

Authors: Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Carla Viegas, Teodor Nikolov, Mihail Tsakov, Gustav Eje Henter

Abstract: This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff… ▽ More This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year's dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field. The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. Additional material is available via the project website at https://youngwoo-yoon.github.io/GENEAchallenge2022/ △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: 12 pages, 5 figures; final version for ACM ICMI 2022

ACM Class: I.3; I.2

arXiv:2108.05762 [pdf, other]

Multimodal analysis of the predictability of hand-gesture properties

Authors: Taras Kucherenko, Rajmund Nagy, Michael Neff, Hedvig Kjellström, Gustav Eje Henter

Abstract: Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/o… ▽ More Embodied conversational agents benefit from being able to accompany their speech with gestures. Although many data-driven approaches to gesture generation have been proposed in recent years, it is still unclear whether such systems can consistently generate gestures that convey meaning. We investigate which gesture properties (phase, category, and semantics) can be predicted from speech text and/or audio using contemporary deep learning. In extensive experiments, we show that gesture properties related to gesture meaning (semantics and category) are predictable from text features (time-aligned FastText embeddings) alone, but not from prosodic audio features, while rhythm-related gesture properties (phase) on the other hand can be predicted from audio features better than from text. These results are encouraging as they indicate that it is possible to equip an embodied agent with content-wise meaningful co-speech gestures using a machine-learning model. △ Less

Submitted 14 January, 2022; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: Accepted at the International Conference on Autonomous Agents and Multiagent Systems (AAMAS) 2022

arXiv:2108.05709 [pdf, other]

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures

Authors: Pieter Wolfert, Jeffrey M. Girard, Taras Kucherenko, Tony Belpaeme

Abstract: While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely dist… ▽ More While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings. △ Less

Submitted 13 August, 2021; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: accepted for publication at International Conference for Multimodal Interaction (ICMI'21)

arXiv:2108.00451 [pdf, ps, other]

doi 10.1007/s00220-022-04466-y

Flexibility of the Pressure Function

Authors: Tamara Kucherenko, Anthony Quas

Abstract: We study the flexibility of the pressure function of a continuous potential (observable) with respect to a parameter regarded as the inverse temperature. The points of non-differentiability of this function are of particular interest in statistical physics, since they correspond to phase transitions. It is well known that the pressure function is convex, Lipschitz, and has an asymptote at infinity… ▽ More We study the flexibility of the pressure function of a continuous potential (observable) with respect to a parameter regarded as the inverse temperature. The points of non-differentiability of this function are of particular interest in statistical physics, since they correspond to phase transitions. It is well known that the pressure function is convex, Lipschitz, and has an asymptote at infinity. We prove that in a setting of one-dimensional compact symbolic systems these are the only restrictions. We present a method to explicitly construct a continuous potential whose pressure function coincides with \emph{any} prescribed convex Lipschitz asymptotically linear function starting at a given positive value of the parameter. In fact, we establish a multidimensional version of this result. As a consequence, we obtain that for a continuous observable the phase transitions can occur at a countable dense set of temperature values. We go further and show that one can vary the cardinality of the set of ergodic equilibrium states as a function of the parameter to be any number, finite or infinite. △ Less

Submitted 28 February, 2023; v1 submitted 1 August, 2021; originally announced August 2021.

MSC Class: 37A60; 37B10; 37D35

arXiv:2106.14736 [pdf, other]

doi 10.1145/3472306.3478333

Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech

Authors: Taras Kucherenko, Rajmund Nagy, Patrik Jonell, Michael Neff, Hedvig Kjellström, Gustav Eje Henter

Abstract: We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to gener… ▽ More We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to generate gestures that are both diverse and representational. Follow-ups and more information can be found on the project page: https://svito-zar.github.io/speech2properties2gestures/ . △ Less

Submitted 13 August, 2021; v1 submitted 28 June, 2021; originally announced June 2021.

Comments: Accepted for publication at the ACM International Conference on Intelligent Virtual Agents (IVA 2021)

ACM Class: I.2.7; I.2.6; I.3.7

Journal ref: International Conference on Intelligent Virtual Agents 2021

arXiv:2102.12302 [pdf, other]

A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents

Authors: Rajmund Nagy, Taras Kucherenko, Birger Moell, André Pereira, Hedvig Kjellström, Ulysses Bernardet

Abstract: Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation - hand and arm movements accompanying speech - is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture ge… ▽ More Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation - hand and arm movements accompanying speech - is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture generation methods have not been evaluated in a real-time interaction with users. We present a proof-of-concept framework, which is intended to facilitate evaluation of modern gesture generation models in interaction. We demonstrate an extensible open-source framework that contains three components: 1) a 3D interactive agent; 2) a chatbot backend; 3) a gesticulating system. Each component can be replaced, making the proposed framework applicable for investigating the effect of different gesturing models in real-time interactions with different communication modalities, chatbot backends, or different agent appearances. The code and video are available at the project page https://nagyrajmund.github.io/project/gesturebot. △ Less

Submitted 24 February, 2021; originally announced February 2021.

Comments: Rajmund Nagy and Taras Kucherenko contributed equally to this work. To be published in the Proceedings of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3-7, 2021, IFAA-MAS, 3 pages, 1 figure

arXiv:2102.11617 [pdf, other]

doi 10.1145/3397481.3450692

A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020

Authors: Taras Kucherenko, Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Gustav Eje Henter

Abstract: Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual resea… ▽ More Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge. △ Less

Submitted 23 February, 2021; originally announced February 2021.

Comments: Accepted for publication at the 26th International Conference on Intelligent User Interfaces (IUI'21). 11 pages, 5 figures

ACM Class: I.3; I.2

arXiv:2101.11898 [pdf, other]

doi 10.1145/3462244.3479957

HEMVIP: Human Evaluation of Multiple Videos in Parallel

Authors: Patrik Jonell, Youngwoo Yoon, Pieter Wolfert, Taras Kucherenko, Gustav Eje Henter

Abstract: In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scor… ▽ More In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources required for such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating the quality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel. This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool are in close agreement with results of prior studies using conventional multiple pairwise comparisons. △ Less

Submitted 20 October, 2021; v1 submitted 28 January, 2021; originally announced January 2021.

Comments: 6 pages, 1 figures. Proceedings of the 22th ACM International Conference on Multimodal Interaction. 2021. Montreal, Canada

arXiv:2101.05684 [pdf, other]

doi 10.1145/3383652.3423874

Generating coherent spontaneous speech and gesture from text

Authors: Simon Alexanderson, Éva Székely, Gustav Eje Henter, Taras Kucherenko, Jonas Beskow

Abstract: Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscrip… ▽ More Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements). Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data: On the speech side, text-to-speech systems are now able to generate highly convincing, spontaneous-sounding speech using unscripted speech audio as the source material. On the motion side, probabilistic motion-generation methods can now synthesise vivid and lifelike speech-driven 3D gesticulation. In this paper, we put these two state-of-the-art technologies together in a coherent fashion for the first time. Concretely, we demonstrate a proof-of-concept system trained on a single-speaker audio and motion-capture dataset, that is able to generate both speech and full-body gestures together from text input. In contrast to previous approaches for joint speech-and-gesture generation, we generate full-body gestures from speech synthesis trained on recordings of spontaneous speech from the same person as the motion-capture data. We illustrate our results by visualising gesture spaces and text-speech-gesture alignments, and through a demonstration video at https://simonalexanderson.github.io/IVA2020 . △ Less

Submitted 14 January, 2021; originally announced January 2021.

Comments: 3 pages, 2 figures, published at the ACM International Conference on Intelligent Virtual Agents (IVA) 2020

MSC Class: 68T07 ACM Class: I.2.6; J.4; I.3.7; I.2.9

Journal ref: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (IVA '20), 2020, 3 pages

arXiv:2009.10760 [pdf, other]

doi 10.1145/3383652.3423860

Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents

Authors: Patrik Jonell, Taras Kucherenko, Ilaria Torre, Jonas Beskow

Abstract: Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to r… ▽ More Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to run in-lab experiments, because of, for example, a pandemic. Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) or Prolific can therefore be a suitable alternative to run certain experiments, such as evaluating virtual agents. Although previous studies investigated the use of crowdsourcing platforms for running experiments, there is still uncertainty as to whether the results are reliable for perceptual studies. Here we replicate a previous experiment where participants evaluated a gesture generation model for virtual agents. The experiment is conducted across three participant pools -- in-lab, Prolific, and AMT -- having similar demographics across the in-lab participants and the Prolific platform. Our results show no difference between the three participant pools in regards to their evaluations of the gesture generation models and their reliability scores. The results indicate that online platforms can successfully be used for perceptual evaluations of this kind. △ Less

Submitted 23 October, 2020; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: Patrik Jonell and Taras Kucherenko contributed equally to this work. Published at the Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent. 8 pages, 7 figures

arXiv:2007.09170 [pdf, other]

doi 10.1080/10447318.2021.1883883

Moving fast and slow: Analysis of representations and post-processing in speech-driven automatic gesture generation

Authors: Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter, Hedvig Kjellström

Abstract: This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordina… ▽ More This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyse the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method. △ Less

Submitted 28 January, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: Extension of our IVA'19 paper. Accepted at the International Journal of Human-Computer Interaction. See more at https://svito-zar.github.io/audio2gestures/. arXiv admin note: substantial text overlap with arXiv:1903.03369

ACM Class: I.2.7; I.2.6; I.3.7

Journal ref: Int. J. Hum. Comput.Interact.(2021)

arXiv:2006.13988 [pdf, ps, other]

Multiple phase transitions on compact symbolic systems

Authors: Tamara Kucherenko, Anthony Quas, Christian Wolf

Abstract: Let $φ:X\to \mathbb R$ be a continuous potential associated with a symbolic dynamical system $T:X\to X$ over a finite alphabet. Introducing a parameter $β>0$ (interpreted as the inverse temperature) we study the regularity of the pressure function $β\mapsto P_{\rm top}(βφ)$ on an interval $[α,\infty)$ with $α>0$. We say that $φ$ has a phase transition at $β_0$ if the pressure function… ▽ More Let $φ:X\to \mathbb R$ be a continuous potential associated with a symbolic dynamical system $T:X\to X$ over a finite alphabet. Introducing a parameter $β>0$ (interpreted as the inverse temperature) we study the regularity of the pressure function $β\mapsto P_{\rm top}(βφ)$ on an interval $[α,\infty)$ with $α>0$. We say that $φ$ has a phase transition at $β_0$ if the pressure function $P_{\rm top}(βφ)$ is not differentiable at $β_0$. This is equivalent to the condition that the potential $β_0φ$ has two (ergodic) equilibrium states with distinct entropies. For any $α>0$ and any increasing sequence of real numbers $(β_n)$ contained in $[α,\infty)$, we construct a potential $φ$ whose phase transitions in $[α,\infty)$ occur precisely at the $β_n$'s. In particular, we obtain a potential which has a countably infinite set of phase transitions. △ Less

Submitted 5 September, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

Comments: In this update we present a revised version of the main theorem which now only deals with phase transitions in an interval $[α,\infty)$ for some fixed $α>0.$

MSC Class: 37A60; 37B10; 37D35

arXiv:2006.09888 [pdf, other]

doi 10.1145/3383652.3423911

Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings

Authors: Patrik Jonell, Taras Kucherenko, Gustav Eje Henter, Jonas Beskow

Abstract: To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synt… ▽ More To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. One key aspect of this is generation of appropriate non-verbal behavior for the agent, for example facial gestures, here defined as facial expressions and head movements. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior. Those that do, typically use deterministic methods that risk producing repetitive and non-vivid motions. In this paper, we introduce a probabilistic method to synthesize interlocutor-aware facial gestures - represented by highly expressive FLAME parameters - in dyadic conversations. Our contributions are: a) a method for feature extraction from multi-party video and speech recordings, resulting in a representation that allows for independent control and manipulation of expression and speech articulation in a 3D avatar; b) an extension to MoGlow, a recent motion-synthesis method based on normalizing flows, to also take multi-modal signals from the interlocutor as input and subsequently output interlocutor-aware facial gestures; and c) a subjective evaluation assessing the use and relative importance of the input modalities. The results show that the model successfully leverages the input from the interlocutor to generate more appropriate behavior. Videos, data, and code available at: https://jonepatr.github.io/lets_face_it. △ Less

Submitted 22 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: Best Paper Award. 8 pages, 4 figures, IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agent

arXiv:2001.09326 [pdf, other]

doi 10.1145/3382507.3418815

Gesticulator: A framework for semantically-aware speech-driven gesture generation

Authors: Taras Kucherenko, Patrik Jonell, Sanne van Waveren, Gustav Eje Henter, Simon Alexanderson, Iolanda Leite, Hedvig Kjellström

Abstract: During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoust… ▽ More During speech, people spontaneously gesticulate, which plays a key role in conveying information. Similarly, realistic co-speech gestures are crucial to enable natural and smooth interactions with social agents. Current end-to-end co-speech gesture generation systems use a single modality for representing speech: either audio or text. These systems are therefore confined to producing either acoustically-linked beat gestures or semantically-linked gesticulation (e.g., raising a hand when saying "high"): they cannot appropriately learn to generate both gesture types. We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots. Subjective and objective evaluations confirm the success of our approach. The code and video are available at the project page https://svito-zar.github.io/gesticulator . △ Less

Submitted 14 January, 2021; v1 submitted 25 January, 2020; originally announced January 2020.

Comments: ICMI 2020 Best Paper Award. Code is available. 9 pages, 6 figures

ACM Class: I.2.7; I.2.6; I.3.7

Journal ref: Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI '20)

arXiv:1909.07317 [pdf, ps, other]

Measures of maximal entropy on subsystems of topological suspension semi-flows

Authors: Tamara Kucherenko, Daniel J. Thompson

Abstract: Given a compact topological dynamical system (X, f) with positive entropy and upper semi-continuous entropy map, and any closed invariant subset $Y \subset X$ with positive entropy, we show that there exists a continuous roof function such that the set of measures of maximal entropy for the suspension semi-flow over (X,f) consists precisely of the lifts of measures which maximize entropy on Y. Thi… ▽ More Given a compact topological dynamical system (X, f) with positive entropy and upper semi-continuous entropy map, and any closed invariant subset $Y \subset X$ with positive entropy, we show that there exists a continuous roof function such that the set of measures of maximal entropy for the suspension semi-flow over (X,f) consists precisely of the lifts of measures which maximize entropy on Y. This result has a number of implications for the possible size of the set of measures of maximal entropy for topological suspension flows. In particular, for a suspension flow on the full shift on a finite alphabet, the set of ergodic measures of maximal entropy may be countable, uncountable, or have any finite cardinality. △ Less

Submitted 27 January, 2021; v1 submitted 16 September, 2019; originally announced September 2019.

Comments: v3: 10 pages. Corrected some typos. To appear in Studia Mathematica

arXiv:1903.03369 [pdf, other]

doi 10.1145/3308532.3329472

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Authors: Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, Naoshi Kaneko, Hedvig Kjellström

Abstract: This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a s… ▽ More This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences. We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning. △ Less

Submitted 11 June, 2019; v1 submitted 8 March, 2019; originally announced March 2019.

Comments: Accepted at IVA '19. Shorter version published at AAMAS '19. The code is available at https://github.com/GestureGeneration/Speech_driven_gesture_generation_with_autoencoder

ACM Class: I.2.6; I.5.1; J.4

arXiv:1803.02665 [pdf, other]

A Neural Network Approach to Missing Marker Reconstruction in Human Motion Capture

Authors: Taras Kucherenko, Jonas Beskow, Hedvig Kjellström

Abstract: Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing… ▽ More Optical motion capture systems have become a widely used technology in various fields, such as augmented reality, robotics, movie production, etc. Such systems use a large number of cameras to triangulate the position of optical markers.The marker positions are estimated with high accuracy. However, especially when tracking articulated bodies, a fraction of the markers in each timestep is missing from the reconstruction. In this paper, we propose to use a neural network approach to learn how human motion is temporally and spatially correlated, and reconstruct missing markers positions through this model. We experiment with two different models, one LSTM-based and one time-window-based. Both methods produce state-of-the-art results, while working online, as opposed to most of the alternative methods, which require the complete sequence to be known. The implementation is publicly available at https://github.com/Svito-zar/NN-for-Missing-Marker-Reconstruction . △ Less

Submitted 25 September, 2018; v1 submitted 7 March, 2018; originally announced March 2018.

Comments: 7 pages, 6 figures

MSC Class: 68T05

arXiv:1709.01613 [pdf, other]

Machine Learning and Social Robotics for Detecting Early Signs of Dementia

Authors: Patrik Jonell, Joseph Mendelson, Thomas Storskog, Goran Hagman, Per Ostberg, Iolanda Leite, Taras Kucherenko, Olga Mikheeva, Ulrika Akenine, Vesna Jelic, Alina Solomon, Jonas Beskow, Joakim Gustafson, Miia Kivipelto, Hedvig Kjellstrom

Abstract: This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The inte… ▽ More This paper presents the EACare project, an ambitious multi-disciplinary collaboration with the aim to develop an embodied system, capable of carrying out neuropsychological tests to detect early signs of dementia, e.g., due to Alzheimer's disease. The system will use methods from Machine Learning and Social Robotics, and be trained with examples of recorded clinician-patient interactions. The interaction will be developed using a participatory design approach. We describe the scope and method of the project, and report on a first Wizard of Oz prototype. △ Less

Submitted 5 September, 2017; originally announced September 2017.

arXiv:1708.00550 [pdf, ps, other]

Measures of maximal entropy for suspension flows over the full shift

Authors: Tamara Kucherenko, Daniel J. Thompson

Abstract: We consider suspension flows with continuous roof function over the full shift $Σ$ on a finite alphabet. For any positive entropy subshift of finite type $Y \subset Σ$, we explictly construct a roof function such that the measure(s) of maximal entropy for the suspension flow over $Σ$ are exactly the lifts of the measure(s) of maximal entropy for $Y$. In the case when $Y$ is transitive, this gives… ▽ More We consider suspension flows with continuous roof function over the full shift $Σ$ on a finite alphabet. For any positive entropy subshift of finite type $Y \subset Σ$, we explictly construct a roof function such that the measure(s) of maximal entropy for the suspension flow over $Σ$ are exactly the lifts of the measure(s) of maximal entropy for $Y$. In the case when $Y$ is transitive, this gives a unique measure of maximal entropy for the flow which is not fully supported. If $Y$ has more than one transitive component, all with the same entropy, this gives explicit examples of suspension flows over the full shift with multiple measures of maximal entropy. This contrasts with the case of a Hölder continuous roof function where it is well known the measure of maximal entropy is unique and fully supported. △ Less

Submitted 12 March, 2019; v1 submitted 1 August, 2017; originally announced August 2017.

Comments: 13 pages, v3: minor revisions. To appear in Mathematische Zeitschrift

MSC Class: 37D35; 37B10; 37A35

arXiv:1604.06512 [pdf, ps, other]

Ground States and Zero-Temperature Measures at the Boundary of Rotation Sets

Authors: Tamara Kucherenko, Christian Wolf

Abstract: We consider a continuous dynamical system $f:X\to X$ on a compact metric space $X$ equipped with an $m$-dimensional continuous potential $Φ=(φ_1,\cdots,φ_m):X\to \bR^m$. We study the set of ground states $ GS(α)$ of the potential $α\cdot Φ$ as a function of the direction vector $α\in S^{m-1}$. %We also study the corresponding rotation vectors $\rv(GS(α))$. We show that the structure of the ground… ▽ More We consider a continuous dynamical system $f:X\to X$ on a compact metric space $X$ equipped with an $m$-dimensional continuous potential $Φ=(φ_1,\cdots,φ_m):X\to \bR^m$. We study the set of ground states $ GS(α)$ of the potential $α\cdot Φ$ as a function of the direction vector $α\in S^{m-1}$. %We also study the corresponding rotation vectors $\rv(GS(α))$. We show that the structure of the ground state sets is naturally related to the geometry of the generalized rotation set of $Φ$. In particular, for each $α$ the set of rotation vectors of $ GS(α)$ forms a non-empty, compact and connected subset of a face $F_α(Φ)$ of the rotation set associated with $α$. Moreover, every ground state maximizes entropy among all invariant measures with rotation vectors in $F_α(Φ)$. We further establish the occurrence of several quite unexpected phenomena. Namely, we construct for any $m\in\bN$ examples with an exposed boundary point (i.e. $F_α(Φ)$ being a singleton) without a unique ground state. Further, we establish the possibility of a line segment face $F_α(Φ)$ with a unique but non-ergodic ground state. Finally, we establish the possibility that the set of rotation vectors of $GS(α)$ is a non-trivial line segment. △ Less

Submitted 21 April, 2016; originally announced April 2016.

Comments: 26 pages

MSC Class: 37D35; 37E45 (Primary); 37B10; 37E45; 37L40 (Secondary)

arXiv:1310.4030 [pdf, ps, other]

Localized topological pressure and equilibrium states

Authors: Tamara Kucherenko, Christian Wolf

Abstract: We introduce the notion of localized topological pressure for continuous maps on compact metric spaces. The localized pressure of a continuous potential $\varphi$ is computed by considering only those $(n,ε)$-separated sets whose statistical sums with respect to an $m$-dimensional potential $Φ$ are "close" to a given value $w\in \bR^m$. We then establish for several classes of systems and potentia… ▽ More We introduce the notion of localized topological pressure for continuous maps on compact metric spaces. The localized pressure of a continuous potential $\varphi$ is computed by considering only those $(n,ε)$-separated sets whose statistical sums with respect to an $m$-dimensional potential $Φ$ are "close" to a given value $w\in \bR^m$. We then establish for several classes of systems and potentials $\varphi$ and $Φ$ a local version of the variational principle. We also construct examples showing that the assumptions in the localized variational principle are fairly sharp. Next, we study localized equilibrium states and show that even in the case of subshifts of finite type and Hölder continuous potentials, there are several new phenomena that do not occur in the theory of classical equilibrium states. In particular, ergodic localized equilibrium states for Hölder continuous potentials are in general not unique. △ Less

Submitted 15 October, 2013; originally announced October 2013.

MSC Class: 37C40; 37D35; 37A60

arXiv:1210.0135 [pdf, ps, other]

Geometry and entropy of generalized rotation sets

Authors: Tamara Kucherenko, Christian Wolf

Abstract: For a continuous map $f$ on a compact metric space we study the geometry and entropy of the generalized rotation set $\R(Φ)$. Here $Φ=(φ_1,...,φ_m)$ is a $m$-dimensional continuous potential and $\R(Φ)$ is the set of all $μ$-integrals of $Φ$ and $μ$ runs over all $f$-invariant probability measures. It is easy to see that the rotation set is a compact and convex subset of $\bR^m$. We study the ques… ▽ More For a continuous map $f$ on a compact metric space we study the geometry and entropy of the generalized rotation set $\R(Φ)$. Here $Φ=(φ_1,...,φ_m)$ is a $m$-dimensional continuous potential and $\R(Φ)$ is the set of all $μ$-integrals of $Φ$ and $μ$ runs over all $f$-invariant probability measures. It is easy to see that the rotation set is a compact and convex subset of $\bR^m$. We study the question if every compact and convex set is attained as a rotation set of a particular set of potentials within a particular class of dynamical systems. We give a positive answer in the case of subshifts of finite type by constructing for every compact and convex set $K$ in $\bR^m$ a potential $Φ=Φ(K)$ with $\R(Φ)=K$. Next, we study the relation between $\R(Φ)$ and the set of all statistical limits $\R_{Pt}(Φ)$. We show that in general these sets differ but also provide criteria that guarantee $\R(Φ)= \R_{Pt}(Φ)$. Finally, we study the entropy function $w\mapsto H(w), w\in \R(Φ)$. We establish a variational principle for the entropy function and show that for certain non-uniformly hyperbolic systems $H(w)$ is determined by the growth rate of those hyperbolic periodic orbits whose $Φ$-integrals are close to $w$. We also show that for systems with strong thermodynamic properties (subshifts of finite type, hyperbolic systems and expansive homeomorphisms with specification, etc.) the entropy function $w\mapsto H(w)$ is real-analytic in the interior of the rotation set. △ Less

Submitted 29 September, 2012; originally announced October 2012.

Showing 1–29 of 29 results for author: Kucherenko, T