-
Love, Joy, and Autism Robots: A Metareview and Provocatype
Authors:
Andrew Hundt,
Gabrielle Ohlson,
Pieter Wolfert,
Lux Miranda,
Sophia Zhu,
Katie Winkle
Abstract:
Previous work has observed how Neurodivergence is often harmfully pathologized in Human-Computer Interaction (HCI) and Human-Robot interaction (HRI) research. We conduct a review of autism robot reviews and find the dominant research direction is Autistic people's second to lowest (24 of 25) research priority: interventions and treatments purporting to 'help' neurodivergent individuals to conform…
▽ More
Previous work has observed how Neurodivergence is often harmfully pathologized in Human-Computer Interaction (HCI) and Human-Robot interaction (HRI) research. We conduct a review of autism robot reviews and find the dominant research direction is Autistic people's second to lowest (24 of 25) research priority: interventions and treatments purporting to 'help' neurodivergent individuals to conform to neurotypical social norms, become better behaved, improve social and emotional skills, and otherwise 'fix' us -- rarely prioritizing the internal experiences that might lead to such differences. Furthermore, a growing body of evidence indicates many of the most popular current approaches risk inflicting lasting trauma and damage on Autistic people. We draw on the principles and findings of the latest Autism research, Feminist HRI, and Robotics to imagine a role reversal, analyze the implications, then conclude with actionable guidance on Autistic-led scientific methods and research directions.
△ Less
Submitted 8 March, 2024;
originally announced March 2024.
-
Evaluating gesture generation in a large-scale open challenge: The GENEA Challenge 2022
Authors:
Taras Kucherenko,
Pieter Wolfert,
Youngwoo Yoon,
Carla Viegas,
Teodor Nikolov,
Mihail Tsakov,
Gustav Eje Henter
Abstract:
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff…
▽ More
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. The dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field.
The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Fréchet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around $-0.5$. Based on the challenge results we formulate numerous recommendations for system building and evaluation.
△ Less
Submitted 28 March, 2024; v1 submitted 15 March, 2023;
originally announced March 2023.
-
The GENEA Challenge 2022: A large evaluation of data-driven co-speech gesture generation
Authors:
Youngwoo Yoon,
Pieter Wolfert,
Taras Kucherenko,
Carla Viegas,
Teodor Nikolov,
Mihail Tsakov,
Gustav Eje Henter
Abstract:
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing diff…
▽ More
This paper reports on the second GENEA Challenge to benchmark data-driven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research papers, differences in results are here only due to differences between methods, enabling direct comparison between systems. This year's dataset was based on 18 hours of full-body motion capture, including fingers, of different persons engaging in dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which previously was a major challenge in the field.
The evaluation results are a revolution, and a revelation. Some synthetic conditions are rated as significantly more human-like than human motion capture. To the best of our knowledge, this has never been shown before on a high-fidelity avatar. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. Additional material is available via the project website at https://youngwoo-yoon.github.io/GENEAchallenge2022/
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures
Authors:
Pieter Wolfert,
Jeffrey M. Girard,
Taras Kucherenko,
Tony Belpaeme
Abstract:
While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely dist…
▽ More
While automatic performance metrics are crucial for machine learning of artificial human-like behaviour, the gold standard for evaluation remains human judgement. The subjective evaluation of artificial human-like behaviour in embodied conversational agents is however expensive and little is known about the quality of the data it returns. Two approaches to subjective evaluation can be largely distinguished, one relying on ratings, the other on pairwise comparisons. In this study we use co-speech gestures to compare the two against each other and answer questions about their appropriateness for evaluation of artificial behaviour. We consider their ability to rate quality, but also aspects pertaining to the effort of use and the time required to collect subjective data. We use crowd sourcing to rate the quality of co-speech gestures in avatars, assessing which method picks up more detail in subjective assessments. We compared gestures generated by three different machine learning models with various level of behavioural quality. We found that both approaches were able to rank the videos according to quality and that the ranking significantly correlated, showing that in terms of quality there is no preference of one method over the other. We also found that pairwise comparisons were slightly faster and came with improved inter-rater reliability, suggesting that for small-scale studies pairwise comparisons are to be favoured over ratings.
△ Less
Submitted 13 August, 2021; v1 submitted 12 August, 2021;
originally announced August 2021.
-
A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020
Authors:
Taras Kucherenko,
Patrik Jonell,
Youngwoo Yoon,
Pieter Wolfert,
Gustav Eje Henter
Abstract:
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual resea…
▽ More
Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
HEMVIP: Human Evaluation of Multiple Videos in Parallel
Authors:
Patrik Jonell,
Youngwoo Yoon,
Pieter Wolfert,
Taras Kucherenko,
Gustav Eje Henter
Abstract:
In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scor…
▽ More
In many research areas, for example motion and gesture generation, objective measures alone do not provide an accurate impression of key stimulus traits such as perceived quality or appropriateness. The gold standard is instead to evaluate these aspects through user studies, especially subjective evaluations of video stimuli. Common evaluation paradigms either present individual stimuli to be scored on Likert-type scales, or ask users to compare and rate videos in a pairwise fashion. However, the time and resources required for such evaluations scale poorly as the number of conditions to be compared increases. Building on standards used for evaluating the quality of multimedia codecs, this paper instead introduces a framework for granular rating of multiple comparable videos in parallel. This methodology essentially analyses all condition pairs at once. Our contributions are 1) a proposed framework, called HEMVIP, for parallel and granular evaluation of multiple video stimuli and 2) a validation study confirming that results obtained using the tool are in close agreement with results of prior studies using conventional multiple pairwise comparisons.
△ Less
Submitted 20 October, 2021; v1 submitted 28 January, 2021;
originally announced January 2021.
-
A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents
Authors:
Pieter Wolfert,
Nicole Robinson,
Tony Belpaeme
Abstract:
Embodied conversational agents (ECA) are often designed to produce nonverbal behavior to complement or enhance their verbal communication. One such form of nonverbal behavior is co-speech gesturing, which involves movements that the agent makes with its arms and hands that are paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, divided i…
▽ More
Embodied conversational agents (ECA) are often designed to produce nonverbal behavior to complement or enhance their verbal communication. One such form of nonverbal behavior is co-speech gesturing, which involves movements that the agent makes with its arms and hands that are paired with verbal communication. Co-speech gestures for ECAs can be created using different generation methods, divided into rule-based and data-driven processes, with the latter gaining traction because of the increasing interest from the applied machine learning community. However, reports on gesture generation methods use a variety of evaluation measures, which hinders comparison. To address this, we present a systematic review on co-speech gesture generation methods for iconic, metaphoric, deictic, and beat gestures, including reported evaluation methods. We review 22 studies that have an ECA with a human-like upper body that uses co-speech gesturing in social human-agent interaction. This includes studies that use human participants to evaluate performance. We found most studies use a within-subject design and rely on a form of subjective evaluation, but without a systematic approach. We argue that the field requires more rigorous and uniform tools for co-speech gesture evaluation, and formulate recommendations for empirical evaluation, including standardized phrases and example scenarios to help systematically test generative models across studies. Furthermore, we also propose a checklist that can be used to report relevant information for the evaluation of generative models, as well as to evaluate co-speech gesture use.
△ Less
Submitted 1 March, 2022; v1 submitted 11 January, 2021;
originally announced January 2021.