-
Exchanging... Watch out!
Authors:
Liu Yang,
Jieyeon Woo,
Catherine Achard,
Catherine Pelachaud
Abstract:
During a conversation, individuals take turns speaking and engage in exchanges, which can occur smoothly or involve interruptions. Listeners have various ways of participating, such as displaying backchannels, signalling the aim to take a turn, waiting for the speaker to yield the floor, or even interrupting and taking over the conversation.
These exchanges are commonplace in natural interaction…
▽ More
During a conversation, individuals take turns speaking and engage in exchanges, which can occur smoothly or involve interruptions. Listeners have various ways of participating, such as displaying backchannels, signalling the aim to take a turn, waiting for the speaker to yield the floor, or even interrupting and taking over the conversation.
These exchanges are commonplace in natural interactions. To create realistic and engaging interactions between human participants and embodied conversational agents (ECAs), it is crucial to equip virtual agents with the ability to manage these exchanges. This includes being able to initiate or respond to signals from the human user. In order to achieve this, we annotate, analyze and characterize these exchanges in human-human conversations. In this paper, we present an analysis of multimodal features, with a focus on prosodic features such as pitch (F0) and loudness, as well as facial expressions, to describe different types of exchanges.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
AMII: Adaptive Multimodal Inter-personal and Intra-personal Model for Adapted Behavior Synthesis
Authors:
Jieyeon Woo,
Mireille Fares,
Catherine Pelachaud,
Catherine Achard
Abstract:
Socially Interactive Agents (SIAs) are physical or virtual embodied agents that display similar behavior as human multimodal behavior. Modeling SIAs' non-verbal behavior, such as speech and facial gestures, has always been a challenging task, given that a SIA can take the role of a speaker or a listener. A SIA must emit appropriate behavior adapted to its own speech, its previous behaviors (intra-…
▽ More
Socially Interactive Agents (SIAs) are physical or virtual embodied agents that display similar behavior as human multimodal behavior. Modeling SIAs' non-verbal behavior, such as speech and facial gestures, has always been a challenging task, given that a SIA can take the role of a speaker or a listener. A SIA must emit appropriate behavior adapted to its own speech, its previous behaviors (intra-personal), and the User's behaviors (inter-personal) for both roles. We propose AMII, a novel approach to synthesize adaptive facial gestures for SIAs while interacting with Users and acting interchangeably as a speaker or as a listener. AMII is characterized by modality memory encoding schema - where modality corresponds to either speech or facial gestures - and makes use of attention mechanisms to capture the intra-personal and inter-personal relationships. We validate our approach by conducting objective evaluations and comparing it with the state-of-the-art approaches.
△ Less
Submitted 18 May, 2023;
originally announced May 2023.
-
RoCNet: 3D Robust Registration of Point-Clouds using Deep Learning
Authors:
Karim Slimani,
Brahim Tamadazte,
Catherine Achard
Abstract:
This paper introduces a new method for 3D point cloud registration based on deep learning. The architecture is composed of three distinct blocs: (i) an encoder composed of a convolutional graph-based descriptor that encodes the immediate neighbourhood of each point and an attention mechanism that encodes the variations of the surface normals. Such descriptors are refined by highlighting attention…
▽ More
This paper introduces a new method for 3D point cloud registration based on deep learning. The architecture is composed of three distinct blocs: (i) an encoder composed of a convolutional graph-based descriptor that encodes the immediate neighbourhood of each point and an attention mechanism that encodes the variations of the surface normals. Such descriptors are refined by highlighting attention between the points of the same set and then between the points of the two sets. (ii) a matching process that estimates a matrix of correspondences using the Sinkhorn algorithm. (iii) Finally, the rigid transformation between the two point clouds is calculated by RANSAC using the Kc best scores from the correspondence matrix. We conduct experiments on the ModelNet40 dataset, and our proposed architecture shows very promising results, outperforming state-of-the-art methods in most of the simulated configurations, including partial overlap and data augmentation with Gaussian noise.
△ Less
Submitted 26 October, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
SALAD: Self-Assessment Learning for Action Detection
Authors:
Guillaume Vaudaux-Ruth,
Adrien Chan-Hon-Tong,
Catherine Achard
Abstract:
Literature on self-assessment in machine learning mainly focuses on the production of well-calibrated algorithms through consensus frameworks i.e. calibration is seen as a problem. Yet, we observe that learning to be properly confident could behave like a powerful regularization and thus, could be an opportunity to improve performance.Precisely, we show that used within a framework of action detec…
▽ More
Literature on self-assessment in machine learning mainly focuses on the production of well-calibrated algorithms through consensus frameworks i.e. calibration is seen as a problem. Yet, we observe that learning to be properly confident could behave like a powerful regularization and thus, could be an opportunity to improve performance.Precisely, we show that used within a framework of action detection, the learning of a self-assessment score is able to improve the whole action localization process.Experimental results show that our approach outperforms the state-of-the-art on two action detection benchmarks. On THUMOS14 dataset, the mAP at [email protected] is improved from 42.8\% to 44.6\%, and from 50.4\% to 51.7\% on ActivityNet1.3 dataset. For lower tIoU values, we achieve even more significant improvements on both datasets.
△ Less
Submitted 13 November, 2020;
originally announced November 2020.
-
ActionSpotter: Deep Reinforcement Learning Framework for Temporal Action Spotting in Videos
Authors:
Guillaume Vaudaux-Ruth,
Adrien Chan-Hon-Tong,
Catherine Achard
Abstract:
Summarizing video content is an important task in many applications. This task can be defined as the computation of the ordered list of actions present in a video. Such a list could be extracted using action detection algorithms. However, it is not necessary to determine the temporal boundaries of actions to know their existence. Moreover, localizing precise boundaries usually requires dense video…
▽ More
Summarizing video content is an important task in many applications. This task can be defined as the computation of the ordered list of actions present in a video. Such a list could be extracted using action detection algorithms. However, it is not necessary to determine the temporal boundaries of actions to know their existence. Moreover, localizing precise boundaries usually requires dense video analysis to be effective. In this work, we propose to directly compute this ordered list by sparsely browsing the video and selecting one frame per action instance, task known as action spotting in literature. To do this, we propose ActionSpotter, a spotting algorithm that takes advantage of Deep Reinforcement Learning to efficiently spot actions while adapting its video browsing speed, without additional supervision. Experiments performed on datasets THUMOS14 and ActivityNet show that our framework outperforms state of the art detection methods. In particular, the spotting mean Average Precision on THUMOS14 is significantly improved from 59.7% to 65.6% while skip** 23% of video.
△ Less
Submitted 10 November, 2020; v1 submitted 15 April, 2020;
originally announced April 2020.
-
Explaining Regression Based Neural Network Model
Authors:
Mégane Millan,
Catherine Achard
Abstract:
Several methods have been proposed to explain Deep Neural Network (DNN). However, to our knowledge, only classification networks have been studied to try to determine which input dimensions motivated the decision. Furthermore, as there is no ground truth to this problem, results are only assessed qualitatively in regards to what would be meaningful for a human. In this work, we design an experimen…
▽ More
Several methods have been proposed to explain Deep Neural Network (DNN). However, to our knowledge, only classification networks have been studied to try to determine which input dimensions motivated the decision. Furthermore, as there is no ground truth to this problem, results are only assessed qualitatively in regards to what would be meaningful for a human. In this work, we design an experimental settings where the ground truth can been established: we generate ideal signals and disrupted signals with errors and learn a neural network that determines the quality of the signals. This quality is simply a score based on the distance between the disrupted signals and the corresponding ideal signal. We then try to find out how the network estimated this score and hope to find the time-step and dimensions of the signal where errors are present. This experimental setting enables us to compare several methods for network explanation and to propose a new method, named AGRA for Accurate Gradient, based on several trainings that decrease the noise present in most state-of-the-art results. Comparative results show that the proposed method outperforms state-of-the-art methods for locating time-steps where errors occur in the signal.
△ Less
Submitted 15 April, 2020;
originally announced April 2020.
-
Single-shot 3D multi-person pose estimation in complex images
Authors:
Abdallah Benzine,
Bertrand Luvison,
Quoc Cuong Pham,
Catherine Achard
Abstract:
In this paper, we propose a new single shot method for multi-person 3D human pose estimation in complex images. The model jointly learns to locate the human joints in the image, to estimate their 3D coordinates and to group these predictions into full human skeletons. The proposed method deals with a variable number of people and does not need bounding boxes to estimate the 3D poses. It leverages…
▽ More
In this paper, we propose a new single shot method for multi-person 3D human pose estimation in complex images. The model jointly learns to locate the human joints in the image, to estimate their 3D coordinates and to group these predictions into full human skeletons. The proposed method deals with a variable number of people and does not need bounding boxes to estimate the 3D poses. It leverages and extends the Stacked Hourglass Network and its multi-scale feature learning to manage multi-person situations. Thus, we exploit a robust 3D human pose formulation to fully describe several 3D human poses even in case of strong occlusions or crops. Then, joint grou** and human pose estimation for an arbitrary number of people are performed using the associative embedding method. Our approach significantly outperforms the state of the art on the challenging CMU Panoptic and a previous single shot method on the MuPoTS-3D dataset. Furthermore, it leads to good results on the complex and synthetic images from the newly proposed JTA Dataset.
△ Less
Submitted 7 January, 2021; v1 submitted 8 November, 2019;
originally announced November 2019.