Search | arXiv e-print repository

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

Authors: Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez

Abstract: In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this appr… ▽ More In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: CVPR 2024 HuMoGen Workshop

arXiv:2404.12942 [pdf, other]

Purposer: Putting Human Motion Generation in Context

Authors: Nicolas Ugrinovic, Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, Gregory Rogez, Francesc Moreno-Noguer

Abstract: We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are u… ▽ More We present a novel method to generate human motion to populate 3D indoor scenes. It can be controlled with various combinations of conditioning signals such as a path in a scene, target poses, past motions, and scenes represented as 3D point clouds. State-of-the-art methods are either models specialized to one single setting, require vast amounts of high-quality and diverse training data, or are unconditional models that do not integrate scene or other contextual information. As a consequence, they have limited applicability and rely on costly training data. To address these limitations, we propose a new method ,dubbed Purposer, based on neural discrete representation learning. Our model is capable of exploiting, in a flexible manner, different types of information already present in open access large-scale datasets such as AMASS. First, we encode unconditional human motion into a discrete latent space. Second, an autoregressive generative model, conditioned with key contextual information, either with prompting or additive tokens, and trained for next-step prediction in this space, synthesizes sequences of latent indices. We further design a novel conditioning block to handle future conditioning information in such a causal model by using a network with two branches to compute separate stacks of features. In this manner, Purposer can generate realistic motion sequences in diverse test scenes. Through exhaustive evaluation, we demonstrate that our multi-contextual solution outperforms existing specialized approaches for specific contextual information, both in terms of quality and diversity. Our model is trained with short sequences, but a byproduct of being able to use various conditioning signals is that at test time different combinations can be used to chain short sequences together and generate long motions within a context scene. △ Less

Submitted 19 April, 2024; originally announced April 2024.

arXiv:2402.16392 [pdf, other]

Placing Objects in Context via Inpainting for Out-of-distribution Segmentation

Authors: Pau de Jorge, Riccardo Volpi, Puneet K. Dokania, Philip H. S. Torr, Gregory Rogez

Abstract: When deploying a semantic segmentation model into the real world, it will inevitably be confronted with semantic classes unseen during training. Thus, to safely deploy such systems, it is crucial to accurately evaluate and improve their anomaly segmentation capabilities. However, acquiring and labelling semantic segmentation data is expensive and unanticipated conditions are long-tail and potentia… ▽ More When deploying a semantic segmentation model into the real world, it will inevitably be confronted with semantic classes unseen during training. Thus, to safely deploy such systems, it is crucial to accurately evaluate and improve their anomaly segmentation capabilities. However, acquiring and labelling semantic segmentation data is expensive and unanticipated conditions are long-tail and potentially hazardous. Indeed, existing anomaly segmentation datasets capture a limited number of anomalies, lack realism or have strong domain shifts. In this paper, we propose the Placing Objects in Context (POC) pipeline to realistically add any object into any image via diffusion models. POC can be used to easily extend any dataset with an arbitrary number of objects. In our experiments, we present different anomaly segmentation datasets based on POC-generated data and show that POC can improve the performance of recent state-of-the-art anomaly fine-tuning methods in several standardized benchmarks. POC is also effective to learn new classes. For example, we use it to edit Cityscapes samples by adding a subset of Pascal classes and show that models trained on such data achieve comparable performance to the Pascal-trained baseline. This corroborates the low sim-to-real gap of models trained on POC-generated images. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.14654 [pdf, other]

Multi-HMR: Multi-Person Whole-Body Human Mesh Recovery in a Single Shot

Authors: Fabien Baradel, Matthieu Armando, Salma Galaaoui, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez, Thomas Lucas

Abstract: We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e, including hands and facial expressions, using the SMPL-X parametric model and spatial location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person centers, using features produced by a standard… ▽ More We present Multi-HMR, a strong single-shot model for multi-person 3D human mesh recovery from a single RGB image. Predictions encompass the whole body, i.e, including hands and facial expressions, using the SMPL-X parametric model and spatial location in the camera coordinate system. Our model detects people by predicting coarse 2D heatmaps of person centers, using features produced by a standard Vision Transformer (ViT) backbone. It then predicts their whole-body pose, shape and spatial location using a new cross-attention module called the Human Prediction Head (HPH), with one query per detected center token, attending to the entire set of features. As direct prediction of SMPL-X parameters yields suboptimal results, we introduce CUFFS; the Close-Up Frames of Full-Body Subjects dataset, containing humans close to the camera with diverse hand poses. We show that incorporating this dataset into training further enhances predictions, particularly for hands, enabling us to achieve state-of-the-art performance. Multi-HMR also optionally accounts for camera intrinsics, if available, by encoding camera ray directions for each image token. This simple design achieves strong performance on whole-body and body-only benchmarks simultaneously. We train models with various backbone sizes and input resolutions. In particular, using a ViT-S backbone and $448\times448$ input images already yields a fast and competitive model with respect to state-of-the-art methods, while considering larger models and higher resolutions further improve performance. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: https://github.com/naver/multi-hmr

arXiv:2311.09104 [pdf, other]

Cross-view and Cross-pose Completion for 3D Human Understanding

Authors: Matthieu Armando, Salma Galaaoui, Fabien Baradel, Thomas Lucas, Vincent Leroy, Romain Brégier, Philippe Weinzaepfel, Grégory Rogez

Abstract: Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, colle… ▽ More Human perception and understanding is a major domain of computer vision which, like many other vision subdomains recently, stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose, object-centric image datasets such as ImageNet, is limited by an important domain shift. On the other hand, collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore, we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs, and temporal (cross-pose) pairs taken from videos, in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture, these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks, and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery. △ Less

Submitted 18 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

Comments: CVPR 2024

arXiv:2309.10748 [pdf, other]

SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction

Authors: Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Fabien Baradel, Salma Galaaoui, Romain Bregier, Matthieu Armando, Jean-Sebastien Franco, Gregory Rogez

Abstract: Recent hand-object interaction datasets show limited real object variability and rely on fitting the MANO parametric model to obtain groundtruth hand shapes. To go beyond these limitations and spur further research, we introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes. Following recent work, we consider a rigid hand-object sce… ▽ More Recent hand-object interaction datasets show limited real object variability and rely on fitting the MANO parametric model to obtain groundtruth hand shapes. To go beyond these limitations and spur further research, we introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes. Following recent work, we consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence. This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe. Although simpler, this hypothesis makes sense in terms of applications where the required accuracy and level of detail is important eg., object hand-over in human-robot collaboration, object scanning, or manipulation and contact point analysis. Importantly, the rigidity of the hand-object systems allows to tackle video-based 3D reconstruction of unknown hand-held objects using a 2-stage pipeline consisting of a rigid registration step followed by a multi-view reconstruction (MVR) part. We carefully evaluate a set of non-trivial baselines for these two stages and show that it is possible to achieve promising object-agnostic 3D hand-object reconstructions employing an SfM toolbox or a hand pose estimator to recover the rigid transforms and off-the-shelf MVR algorithms. However, these methods remain sensitive to the initial camera pose estimates which might be imprecise due to lack of textures on the objects or heavy occlusions of the hands, leaving room for improvements in the reconstruction. Code and dataset are available at https://europe.naverlabs.com/research/showme △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: Paper and Appendix, Accepted in ACVR workshop at ICCV conference

arXiv:2309.08480 [pdf, other]

PoseFix: Correcting 3D Human Poses with Natural Language

Authors: Ginger Delmas, Philippe Weinzaepfel, Francesc Moreno-Noguer, Grégory Rogez

Abstract: Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections betwee… ▽ More Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses. △ Less

Submitted 17 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: Published in ICCV 2023

arXiv:2306.07399 [pdf, other]

4DHumanOutfit: a multi-subject 4D dataset of human motion sequences in varying outfits exhibiting large displacements

Authors: Matthieu Armando, Laurence Boissieux, Edmond Boyer, Jean-Sebastien Franco, Martin Humenberger, Christophe Legras, Vincent Leroy, Mathieu Marsot, Julien Pansiot, Sergi Pujades, Rim Rekik, Gregory Rogez, Anilkumar Swamy, Stefanie Wuhrer

Abstract: This work presents 4DHumanOutfit, a new dataset of densely sampled spatio-temporal 4D human motion data of different actors, outfits and motions. The dataset is designed to contain different actors wearing different outfits while performing different motions in each outfit. In this way, the dataset can be seen as a cube of data containing 4D motion sequences along 3 axes with identity, outfit and… ▽ More This work presents 4DHumanOutfit, a new dataset of densely sampled spatio-temporal 4D human motion data of different actors, outfits and motions. The dataset is designed to contain different actors wearing different outfits while performing different motions in each outfit. In this way, the dataset can be seen as a cube of data containing 4D motion sequences along 3 axes with identity, outfit and motion. This rich dataset has numerous potential applications for the processing and creation of digital humans, e.g. augmented reality, avatar creation and virtual try on. 4DHumanOutfit is released for research purposes at https://kinovis.inria.fr/4dhumanoutfit/. In addition to image data and 4D reconstructions, the dataset includes reference solutions for each axis. We present independent baselines along each axis that demonstrate the value of these reference solutions for evaluation tasks. △ Less

Submitted 12 June, 2023; originally announced June 2023.

arXiv:2303.11298 [pdf, other]

Reliability in Semantic Segmentation: Are We on the Right Track?

Authors: Pau de Jorge, Riccardo Volpi, Philip Torr, Gregory Rogez

Abstract: Motivated by the increasing popularity of transformers in computer vision, in recent times there has been a rapid development of novel architectures. While in-domain performance follows a constant, upward trend, properties like robustness or uncertainty estimation are less explored -leaving doubts about advances in model reliability. Studies along these axes exist, but they are mainly limited to c… ▽ More Motivated by the increasing popularity of transformers in computer vision, in recent times there has been a rapid development of novel architectures. While in-domain performance follows a constant, upward trend, properties like robustness or uncertainty estimation are less explored -leaving doubts about advances in model reliability. Studies along these axes exist, but they are mainly limited to classification models. In contrast, we carry out a study on semantic segmentation, a relevant task for many real-world applications where model reliability is paramount. We analyze a broad variety of models, spanning from older ResNet-based architectures to novel transformers and assess their reliability based on four metrics: robustness, calibration, misclassification detection and out-of-distribution (OOD) detection. We find that while recent models are significantly more robust, they are not overall more reliable in terms of uncertainty estimation. We further explore methods that can come to the rescue and show that improving calibration can also help with other uncertainty metrics such as misclassification or OOD detection. This is the first study on modern segmentation models focused on both robustness and uncertainty estimation and we hope it will help practitioners and researchers interested in this fundamental vision task. Code available at https://github.com/naver/relis. △ Less

Submitted 20 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023

arXiv:2210.11795 [pdf, other]

PoseScript: Linking 3D Human Poses and Natural Language

Authors: Ginger Delmas, Philippe Weinzaepfel, Thomas Lucas, Francesc Moreno-Noguer, Grégory Rogez

Abstract: Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset.… ▽ More Natural language plays a critical role in many computer vision applications, such as image captioning, visual question answering, and cross-modal retrieval, to provide fine-grained semantic information. Unfortunately, while human pose is key to human understanding, current 3D human pose datasets lack detailed language descriptions. To address this issue, we have introduced the PoseScript dataset. This dataset pairs more than six thousand 3D human poses from AMASS with rich human-annotated descriptions of the body parts and their spatial relationships. Additionally, to increase the size of the dataset to a scale that is compatible with data-hungry learning algorithms, we have proposed an elaborate captioning process that generates automatic synthetic descriptions in natural language from given 3D keypoints. This process extracts low-level pose information, known as "posecodes", using a set of simple but generic rules on the 3D keypoints. These posecodes are then combined into higher level textual descriptions using syntactic rules. With automatic annotations, the amount of available data significantly scales up (100k), making it possible to effectively pretrain deep models for finetuning on human captions. To showcase the potential of annotated poses, we present three multi-modal learning tasks that utilize the PoseScript dataset. Firstly, we develop a pipeline that maps 3D poses and textual descriptions into a joint embedding space, allowing for cross-modal retrieval of relevant poses from large-scale datasets. Secondly, we establish a baseline for a text-conditioned model generating 3D poses. Thirdly, we present a learned process for generating pose descriptions. These applications demonstrate the versatility and usefulness of annotated poses in various tasks and pave the way for future research in the field. △ Less

Submitted 19 January, 2024; v1 submitted 21 October, 2022; originally announced October 2022.

Comments: Extended version of the ECCV 2022 paper

arXiv:2210.10542 [pdf, other]

PoseGPT: Quantization-based 3D Human Motion Generation and Forecasting

Authors: Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, Grégory Rogez

Abstract: We address the problem of action-conditioned generation of human motion sequences. Existing work falls into two categories: forecast models conditioned on observed past motions, or generative models conditioned on action labels and duration only. In contrast, we generate motion conditioned on observations of arbitrary length, including none. To solve this generalized problem, we propose PoseGPT, a… ▽ More We address the problem of action-conditioned generation of human motion sequences. Existing work falls into two categories: forecast models conditioned on observed past motions, or generative models conditioned on action labels and duration only. In contrast, we generate motion conditioned on observations of arbitrary length, including none. To solve this generalized problem, we propose PoseGPT, an auto-regressive transformer-based approach which internally compresses human motion into quantized latent sequences. An auto-encoder first maps human motion to latent index sequences in a discrete space, and vice-versa. Inspired by the Generative Pretrained Transformer (GPT), we propose to train a GPT-like model for next-index prediction in that space; this allows PoseGPT to output distributions on possible futures, with or without conditioning on past motion. The discrete and compressed nature of the latent space allows the GPT-like model to focus on long-range signal, as it removes low-level redundancy in the input signal. Predicting discrete indices also alleviates the common pitfall of predicting averaged poses, a typical failure case when regressing continuous values, as the average of discrete targets is not a target itself. Our experimental results show that our proposed approach achieves state-of-the-art results on HumanAct12, a standard but small scale dataset, as well as on BABEL, a recent large scale MoCap dataset, and on GRAB, a human-object interactions dataset. △ Less

Submitted 19 October, 2022; originally announced October 2022.

Comments: ECCV'22 Conference paper

arXiv:2210.00627 [pdf, other]

MonoNHR: Monocular Neural Human Renderer

Authors: Hongsuk Choi, Gyeongsik Moon, Matthieu Armando, Vincent Leroy, Kyoung Mu Lee, Gregory Rogez

Abstract: Existing neural human rendering methods struggle with a single image input due to the lack of information in invisible areas and the depth ambiguity of pixels in visible areas. In this regard, we propose Monocular Neural Human Renderer (MonoNHR), a novel approach that renders robust free-viewpoint images of an arbitrary human given only a single image. MonoNHR is the first method that (i) renders… ▽ More Existing neural human rendering methods struggle with a single image input due to the lack of information in invisible areas and the depth ambiguity of pixels in visible areas. In this regard, we propose Monocular Neural Human Renderer (MonoNHR), a novel approach that renders robust free-viewpoint images of an arbitrary human given only a single image. MonoNHR is the first method that (i) renders human subjects never seen during training in a monocular setup, and (ii) is trained in a weakly-supervised manner without geometry supervision. First, we propose to disentangle 3D geometry and texture features and to condition the texture inference on the 3D geometry features. Second, we introduce a Mesh Inpainter module that inpaints the occluded parts exploiting human structural priors such as symmetry. Experiments on ZJU-MoCap, AIST, and HUMBI datasets show that our approach significantly outperforms the recent methods adapted to the monocular case. △ Less

Submitted 2 October, 2022; originally announced October 2022.

Comments: Hongsuk Choi and Gyeongsik Moon contributed equally, 15 pages including the reference and supplementary material

arXiv:2208.10211 [pdf, other]

PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

Authors: Fabien Baradel, Romain Brégier, Thibault Groueix, Philippe Weinzaepfel, Yannis Kalantidis, Grégory Rogez

Abstract: Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a… ▽ More Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert. △ Less

Submitted 19 October, 2022; v1 submitted 22 August, 2022; originally announced August 2022.

Comments: Accepted to TPAMI 2022

arXiv:2206.08242 [pdf, other]

Catastrophic overfitting can be induced with discriminative non-robust features

Authors: Guillermo Ortiz-Jiménez, Pau de Jorge, Amartya Sanyal, Adel Bibi, Puneet K. Dokania, Pascal Frossard, Gregory Rogéz, Philip H. S. Torr

Abstract: Adversarial training (AT) is the de facto method for building robust neural networks, but it can be computationally expensive. To mitigate this, fast single-step attacks can be used, but this may lead to catastrophic overfitting (CO). This phenomenon appears when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just… ▽ More Adversarial training (AT) is the de facto method for building robust neural networks, but it can be computationally expensive. To mitigate this, fast single-step attacks can be used, but this may lead to catastrophic overfitting (CO). This phenomenon appears when networks gain non-trivial robustness during the first stages of AT, but then reach a breaking point where they become vulnerable in just a few iterations. The mechanisms that lead to this failure mode are still poorly understood. In this work, we study the onset of CO in single-step AT methods through controlled modifications of typical datasets of natural images. In particular, we show that CO can be induced at much smaller $ε$ values than it was observed before just by injecting images with seemingly innocuous features. These features aid non-robust classification but are not enough to achieve robustness on their own. Through extensive experiments we analyze this novel phenomenon and discover that the presence of these easy features induces a learning shortcut that leads to CO. Our findings provide new insights into the mechanisms of CO and improve our understanding of the dynamics of AT. The code to reproduce our experiments can be found at https://github.com/gortizji/co_features. △ Less

Submitted 15 August, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: Published in Transactions on Machine Learning Research (TMLR)

arXiv:2202.01181 [pdf, other]

Make Some Noise: Reliable and Efficient Single-Step Adversarial Training

Authors: Pau de Jorge, Adel Bibi, Riccardo Volpi, Amartya Sanyal, Philip H. S. Torr, Grégory Rogez, Puneet K. Dokania

Abstract: Recently, Wong et al. showed that adversarial training with single-step FGSM leads to a characteristic failure mode named Catastrophic Overfitting (CO), in which a model becomes suddenly vulnerable to multi-step attacks. Experimentally they showed that simply adding a random perturbation prior to FGSM (RS-FGSM) could prevent CO. However, Andriushchenko and Flammarion observed that RS-FGSM still le… ▽ More Recently, Wong et al. showed that adversarial training with single-step FGSM leads to a characteristic failure mode named Catastrophic Overfitting (CO), in which a model becomes suddenly vulnerable to multi-step attacks. Experimentally they showed that simply adding a random perturbation prior to FGSM (RS-FGSM) could prevent CO. However, Andriushchenko and Flammarion observed that RS-FGSM still leads to CO for larger perturbations, and proposed a computationally expensive regularizer (GradAlign) to avoid it. In this work, we methodically revisit the role of noise and clip** in single-step adversarial training. Contrary to previous intuitions, we find that using a stronger noise around the clean sample combined with \textit{not clip**} is highly effective in avoiding CO for large perturbation radii. We then propose Noise-FGSM (N-FGSM) that, while providing the benefits of single-step adversarial training, does not suffer from CO. Empirical analyses on a large suite of experiments show that N-FGSM is able to match or surpass the performance of previous state-of-the-art GradAlign, while achieving 3x speed-up. Code can be found in https://github.com/pdejorge/N-FGSM △ Less

Submitted 17 October, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

Comments: Published in NeurIPS 2022

arXiv:2112.12004 [pdf, other]

Barely-Supervised Learning: Semi-Supervised Learning with very few labeled images

Authors: Thomas Lucas, Philippe Weinzaepfel, Gregory Rogez

Abstract: This paper tackles the problem of semi-supervised learning when the set of labeled samples is limited to a small number of images per class, typically less than 10, problem that we refer to as barely-supervised learning. We analyze in depth the behavior of a state-of-the-art semi-supervised method, FixMatch, which relies on a weakly-augmented version of an image to obtain supervision signal for a… ▽ More This paper tackles the problem of semi-supervised learning when the set of labeled samples is limited to a small number of images per class, typically less than 10, problem that we refer to as barely-supervised learning. We analyze in depth the behavior of a state-of-the-art semi-supervised method, FixMatch, which relies on a weakly-augmented version of an image to obtain supervision signal for a more strongly-augmented version. We show that it frequently fails in barely-supervised scenarios, due to a lack of training signal when no pseudo-label can be predicted with high confidence. We propose a method to leverage self-supervised methods that provides training signal in the absence of confident pseudo-labels. We then propose two methods to refine the pseudo-label selection process which lead to further improvements. The first one relies on a per-sample history of the model predictions, akin to a voting scheme. The second iteratively updates class-dependent confidence thresholds to better explore classes that are under-represented in the pseudo-labels. Our experiments show that our approach performs significantly better on STL-10 in the barely-supervised regime, e.g. with 4 or 8 labeled images per class. △ Less

Submitted 22 December, 2021; originally announced December 2021.

arXiv:2110.09243 [pdf, other]

Leveraging MoCap Data for Human Mesh Recovery

Authors: Fabien Baradel, Thibault Groueix, Philippe Weinzaepfel, Romain Brégier, Yannis Kalantidis, Grégory Rogez

Abstract: Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with… ▽ More Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds. In fact, we show that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. We further study the use of MoCap data for video, and introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling. It is simple, generic and can be plugged on top of any state-of-the-art image-based model in order to transform it in a video-based model leveraging temporal information. Our experimental results show that the proposed approaches reach state-of-the-art performance on various datasets including 3DPW, MPI-INF-3DHP, MuPoTS-3D, MCB and AIST. Test code and models will be available soon. △ Less

Submitted 18 October, 2021; originally announced October 2021.

Comments: 3DV 2021

arXiv:2103.09331 [pdf]

doi 10.1021/acs.nanolett.1c00973

Large Enhancement of Ferro-Magnetism under Collective Strong Coupling of YBCO Nanoparticles

Authors: Anoop Thomas, Eloise Devaux, Kalaivanan Nagarajan, Guillaume Rogez, Marcus Seidel, Fanny Richard, Cyriaque Genet, Marc Drillon, Thomas W. Ebbesen

Abstract: Light-matter strong coupling in the vacuum limit has been shown to enhance material properties over the past decade. Oxide nanoparticles are known to exhibit weak ferromagnetism due to vacancies in the lattice. Here we report the 700-fold enhancement of the ferromagnetism of YBa$_2$Cu$_3$O$_{7-x}$ nanoparticles under cooperative strong coupling at room temperature. The magnetic moment reaches 0.90… ▽ More Light-matter strong coupling in the vacuum limit has been shown to enhance material properties over the past decade. Oxide nanoparticles are known to exhibit weak ferromagnetism due to vacancies in the lattice. Here we report the 700-fold enhancement of the ferromagnetism of YBa$_2$Cu$_3$O$_{7-x}$ nanoparticles under cooperative strong coupling at room temperature. The magnetic moment reaches 0.90 $μ_{\rm B}$/mol, and with such a high value, it competes with YBa$_2$Cu$_3$O$_{7-x}$ superconductivity at low temperature. This strong ferromagnetism at room temperature suggest that strong coupling is a new tool for the development of next generations of magnetic and spintronic nanodevices. △ Less

Submitted 23 March, 2021; v1 submitted 16 March, 2021; originally announced March 2021.

Comments: 24 pages, 4 figures - difference with v1 version: revised Supplementary Information file

arXiv:2012.09696 [pdf, other]

Multi-FinGAN: Generative Coarse-To-Fine Sampling of Multi-Finger Grasps

Authors: Jens Lundell, Enric Corona, Tran Nguyen Le, Francesco Verdoja, Philippe Weinzaepfel, Gregory Rogez, Francesc Moreno-Noguer, Ville Kyrki

Abstract: While there exists many methods for manipulating rigid objects with parallel-jaw grippers, gras** with multi-finger robotic hands remains a quite unexplored research topic. Reasoning and planning collision-free trajectories on the additional degrees of freedom of several fingers represents an important challenge that, so far, involves computationally costly and slow processes. In this work, we p… ▽ More While there exists many methods for manipulating rigid objects with parallel-jaw grippers, gras** with multi-finger robotic hands remains a quite unexplored research topic. Reasoning and planning collision-free trajectories on the additional degrees of freedom of several fingers represents an important challenge that, so far, involves computationally costly and slow processes. In this work, we present Multi-FinGAN, a fast generative multi-finger grasp sampling method that synthesizes high quality grasps directly from RGB-D images in about a second. We achieve this by training in an end-to-end fashion a coarse-to-fine model composed of a classification network that distinguishes grasp types according to a specific taxonomy and a refinement network that produces refined grasp poses and joint angles. We experimentally validate and benchmark our method against a standard grasp-sampling method on 790 grasps in simulation and 20 grasps on a real Franka Emika Panda. All experimental results using our method show consistent improvements both in terms of grasp quality metrics and grasp success rate. Remarkably, our approach is up to 20-30 times faster than the baseline, a significant improvement that opens the door to feedback-based grasp re-planning and task informative gras**. Code is available at https://irobotics.aalto.fi/multi-fingan/. △ Less

Submitted 15 March, 2021; v1 submitted 17 December, 2020; originally announced December 2020.

Comments: Accepted to IEEE Conference on Robotics and Automation 2021 (ICRA). Code is available at https://irobotics.aalto.fi/multi-fingan/

arXiv:2012.04324 [pdf, other]

Continual Adaptation of Visual Representations via Domain Randomization and Meta-learning

Authors: Riccardo Volpi, Diane Larlus, Grégory Rogez

Abstract: Most standard learning approaches lead to fragile models which are prone to drift when sequentially trained on samples of a different nature - the well-known "catastrophic forgetting" issue. In particular, when a model consecutively learns from different visual domains, it tends to forget the past domains in favor of the most recent ones. In this context, we show that one way to learn models that… ▽ More Most standard learning approaches lead to fragile models which are prone to drift when sequentially trained on samples of a different nature - the well-known "catastrophic forgetting" issue. In particular, when a model consecutively learns from different visual domains, it tends to forget the past domains in favor of the most recent ones. In this context, we show that one way to learn models that are inherently more robust against forgetting is domain randomization - for vision tasks, randomizing the current domain's distribution with heavy image manipulations. Building on this result, we devise a meta-learning strategy where a regularizer explicitly penalizes any loss associated with transferring the model from the current domain to different "auxiliary" meta-domains, while also easing adaptation to them. Such meta-domains are also generated through randomized image manipulations. We empirically demonstrate in a variety of experiments - spanning from classification to semantic segmentation - that our approach results in models that are less prone to catastrophic forgetting when transferred to new domains. △ Less

Submitted 8 April, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: Accepted to CVPR 2021

arXiv:2012.02743 [pdf, other]

SMPLy Benchmarking 3D Human Pose Estimation in the Wild

Authors: Vincent Leroy, Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Grégory Rogez

Abstract: Predicting 3D human pose from images has seen great recent improvements. Novel approaches that can even predict both pose and shape from a single input image have been introduced, often relying on a parametric model of the human body such as SMPL. While qualitative results for such methods are often shown for images captured in-the-wild, a proper benchmark in such conditions is still missing, as i… ▽ More Predicting 3D human pose from images has seen great recent improvements. Novel approaches that can even predict both pose and shape from a single input image have been introduced, often relying on a parametric model of the human body such as SMPL. While qualitative results for such methods are often shown for images captured in-the-wild, a proper benchmark in such conditions is still missing, as it is cumbersome to obtain ground-truth 3D poses elsewhere than in a motion capture room. This paper presents a pipeline to easily produce and validate such a dataset with accurate ground-truth, with which we benchmark recent 3D human pose estimation methods in-the-wild. We make use of the recently introduced Mannequin Challenge dataset which contains in-the-wild videos of people frozen in action like statues and leverage the fact that people are static and the camera moving to accurately fit the SMPL model on the sequences. A total of 24,428 frames with registered body models are then selected from 567 scenes at almost no cost, using only online RGB videos. We benchmark state-of-the-art SMPL-based human pose estimation methods on this dataset. Our results highlight that challenges remain, in particular for difficult poses or for scenes where the persons are partially truncated or occluded. △ Less

Submitted 4 December, 2020; originally announced December 2020.

Comments: 3DV 2020 Oral presentation

arXiv:2008.09457 [pdf, other]

DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild

Authors: Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, Grégory Rogez

Abstract: We introduce DOPE, the first method to detect and estimate whole-body 3D human poses, including bodies, hands and faces, in the wild. Achieving this level of details is key for a number of applications that require understanding the interactions of the people with each other or with the environment. The main challenge is the lack of in-the-wild data with labeled whole-body 3D poses. In previous wo… ▽ More We introduce DOPE, the first method to detect and estimate whole-body 3D human poses, including bodies, hands and faces, in the wild. Achieving this level of details is key for a number of applications that require understanding the interactions of the people with each other or with the environment. The main challenge is the lack of in-the-wild data with labeled whole-body 3D poses. In previous work, training data has been annotated or generated for simpler tasks focusing on bodies, hands or faces separately. In this work, we propose to take advantage of these datasets to train independent experts for each part, namely a body, a hand and a face expert, and distill their knowledge into a single deep network designed for whole-body 2D-3D pose detection. In practice, given a training image with partial or no annotation, each part expert detects its subset of keypoints in 2D and 3D and the resulting estimations are combined to obtain whole-body pseudo ground-truth poses. A distillation loss encourages the whole-body predictions to mimic the experts' outputs. Our results show that this approach significantly outperforms the same whole-body model trained without distillation while staying close to the performance of the experts. Importantly, DOPE is computationally less demanding than the ensemble of experts and can achieve real-time performance. Test code and models are available at https://europe.naverlabs.com/research/computer-vision/dope. △ Less

Submitted 21 August, 2020; originally announced August 2020.

Comments: ECCV 2020

arXiv:2006.09081 [pdf, other]

Progressive Skeletonization: Trimming more fat from a network at initialization

Authors: Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H. S. Torr, Gregory Rogez, Puneet K. Dokania

Abstract: Recent studies have shown that skeletonization (pruning parameters) of networks \textit{at initialization} provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx $95\%$), these approaches fail to preserve the network performance, and to our surprise,… ▽ More Recent studies have shown that skeletonization (pruning parameters) of networks \textit{at initialization} provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx $95\%$), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum {\em foresight connection sensitivity} (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration. We then propose two approximate procedures to maximize our objective (1) Iterative SNIP: allows parameters that were unimportant at earlier stages of skeletonization to become important at later stages; and (2) FORCE: iterative process that allows exploration by allowing already pruned parameters to resurrect at later stages of skeletonization. Empirical analyses on a large suite of experiments show that our approach, while providing at least as good a performance as other recent approaches on moderate pruning levels, provides remarkably improved performance on higher pruning levels (could remove up to $99.5\%$ parameters while kee** the networks trainable). Code can be found in https://github.com/naver/force. △ Less

Submitted 19 March, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

arXiv:2003.13764 [pdf, other]

Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction

Authors: Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, MingXiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren, Weiting Huang, Haifeng Sun, Marek Hrúz, Jakub Kanis, Zdeněk Krňoul, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou , et al. (10 additional authors not shown)

Abstract: We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole… ▽ More We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole space densely, despite recent efforts in collecting large-scale training datasets. This sampling problem is even more severe when hands are interacting with objects and/or inputs are RGB rather than depth images, as RGB images also vary with lighting conditions and colors. To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators (HPEs) to interpolate and extrapolate the poses of a training set. More exactly, HANDS'19 is designed (a) to evaluate the influence of both depth and color modalities on 3D hand pose estimation, under the presence or absence of objects; (b) to assess the generalisation abilities w.r.t. four main axes: shapes, articulations, viewpoints, and objects; (c) to explore the use of a synthetic hand model to fill the gaps of current datasets. Through the challenge, the overall accuracy has dramatically improved over the baseline, especially on extrapolation tasks, from 27mm to 13mm mean joint error. Our analyses highlight the impacts of: Data pre-processing, ensemble approaches, the use of a parametric 3D hand model (MANO), and different HPE methods/backbones. △ Less

Submitted 10 September, 2020; v1 submitted 30 March, 2020; originally announced March 2020.

Comments: European Conference on Computer Vision (ECCV), 2020

arXiv:1912.07249 [pdf, other]

Mimetics: Towards Understanding Human Actions Out of Context

Authors: Philippe Weinzaepfel, Grégory Rogez

Abstract: Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete unde… ▽ More Recent methods for video action recognition have reached outstanding performances on existing benchmarks. However, they tend to leverage context such as scenes or objects instead of focusing on understanding the human action itself. For instance, a tennis field leads to the prediction playing tennis irrespectively of the actions performed in the video. In contrast, humans have a more complete understanding of actions and can recognize them without context. The best example of out-of-context actions are mimes, that people can typically recognize despite missing relevant objects and scenes. In this paper, we propose to benchmark action recognition methods in such absence of context and introduce a novel dataset, Mimetics, consisting of mimed actions for a subset of 50 classes from the Kinetics benchmark. Our experiments show that (a) state-of-the-art 3D convolutional neural networks obtain disappointing results on such videos, highlighting the lack of true understanding of the human actions and (b) models leveraging body language via human pose are less prone to context biases. In particular, we show that applying a shallow neural network with a single temporal convolution over body pose features transferred to the action recognition problem performs surprisingly well compared to 3D action recognition methods. △ Less

Submitted 2 February, 2021; v1 submitted 16 December, 2019; originally announced December 2019.

arXiv:1908.00439 [pdf, other]

Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images

Authors: Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin, Cordelia Schmid, Gregory Rogez

Abstract: In this paper, we tackle the problem of 3D human shape estimation from single RGB images. While the recent progress in convolutional neural networks has allowed impressive results for 3D human pose estimation, estimating the full 3D shape of a person is still an open issue. Model-based approaches can output precise meshes of naked under-cloth human bodies but fail to estimate details and un-modell… ▽ More In this paper, we tackle the problem of 3D human shape estimation from single RGB images. While the recent progress in convolutional neural networks has allowed impressive results for 3D human pose estimation, estimating the full 3D shape of a person is still an open issue. Model-based approaches can output precise meshes of naked under-cloth human bodies but fail to estimate details and un-modelled elements such as hair or clothing. On the other hand, non-parametric volumetric approaches can potentially estimate complete shapes but, in practice, they are limited by the resolution of the output grid and cannot produce detailed estimates. In this work, we propose a non-parametric approach that employs a double depth map to represent the 3D shape of a person: a visible depth map and a "hidden" depth map are estimated and combined, to reconstruct the human 3D shape as done with a "mould". This representation through 2D depth maps allows a higher resolution output with a much lower dimension than voxel-based volumetric representations. Additionally, our fully derivable depth-based model allows us to efficiently incorporate a discriminator in an adversarial fashion to improve the accuracy and "humanness" of the 3D output. We train and quantitatively validate our approach on SURREAL and on 3D-HUMANS, a new photorealistic dataset made of semi-synthetic in-house videos annotated with 3D ground truth surfaces. △ Less

Submitted 1 August, 2019; originally announced August 2019.

Comments: Accepted at ICCV 2019

arXiv:1905.07487 [pdf]

doi 10.1002/adfm.201901878

Designing of a magnetodielectric system in hybrid organic-inorganic framework, a perovskite layered phosphonate MnO3PC6H4-m-Br.H2O

Authors: Tathamay Basu, Clarisse Bloyet, Felicien Beaubras, Vincent Caignaert, Olivier Perez, Jean-Michel Rueff, Alain Pautrat, Bernard Raveau, Jean-François Lohier, Paul-Alain Jaffrès, Hélène Couthon, Guillaume Rogez, Grégory Taupier, Honorat Dorkenoo

Abstract: The research on multiferrocity and magnetoelectric coupling in metal-organic system is rare. Very few hybrid organic-inorganic frameworks (HOIF) exhibit direct magnetoelectric coupling (coupling between spins and dipoles) and also restricted to particular COOH-based system. We show how one can design a hybrid system to obtain such coupling based on the rational design of the organic ligands. The l… ▽ More The research on multiferrocity and magnetoelectric coupling in metal-organic system is rare. Very few hybrid organic-inorganic frameworks (HOIF) exhibit direct magnetoelectric coupling (coupling between spins and dipoles) and also restricted to particular COOH-based system. We show how one can design a hybrid system to obtain such coupling based on the rational design of the organic ligands. The layered phosphonate, MnO3PC6H5.H2O, consisting of perovskite layers stacked with organic phenyl layers, is used as a starting potential candidate. To introduce dipole moment, a closely related metal-phosphonate, MnO3PC6H4-m-Br.H2O is designed. For this purpose, this phosphonate is prepared from 3-bromophenylphosphonic acid that features one electronegative bromine atom directly attached on the aromatic ring in meta position, lowering the symmetry of precursor itself. Thus, bromobenzene moieties in MnO3PC6H4-m-Br.H2O induce a finite dipole moment. This new designed compound exhibits complex magnetism, as observed in layered alkyl chains MnO3PCnH2n+1.H2O materials, namely, 2D magnetic ordering around 20 K followed by weak ferromagnetic ordering below 12 K(T1) with a magnetic field (H)-induced transition around 25 kOe below T1. All these magnetic features are exactly captured in T and H-dependent dielectric constant, epsilon(T) and epsilon(H). This demonstrates direct magnetodielectric coupling in this designed hybrid and yields a new path to tune multiferroic ordering and magnetodielectric coupling. △ Less

Submitted 17 May, 2019; originally announced May 2019.

Comments: accepted in Advanced Functional Materials

Journal ref: Adv. Funct. Mater.2019, 1901878

arXiv:1809.04809 [pdf]

doi 10.1039/C8TC04328K

Incipient spin-dipole coupling in a 1D helical-chain metal-organic hybrid

Authors: Tathamay Basu, Clarisse Bloyet, Jean-Michel Rueff, Vincent Caignaert, Alain Pautrat, Bernard Raveau, Guillaume Rogez, Paul-Alain Jaffrès

Abstract: Low dimensional magnetic systems (such as spin-chain) are extensively studied due to their exotic magnetic properties. Here, we would like to address that such systems should also be interesting in the field of dielectric, ferroelectricity and magnetodielectric coupling. As a prototype example, we have investigated a one-dimensional (1D) helical-chain metal-organic hybrid system with a chiral stru… ▽ More Low dimensional magnetic systems (such as spin-chain) are extensively studied due to their exotic magnetic properties. Here, we would like to address that such systems should also be interesting in the field of dielectric, ferroelectricity and magnetodielectric coupling. As a prototype example, we have investigated a one-dimensional (1D) helical-chain metal-organic hybrid system with a chiral structure which shows a broad hump in magnetic susceptibility around 55 K (Tmax). The complex dielectric constant exactly traces this feature, which suggests intrinsic magnetodielectric coupling in this chiral system. The dipolar ordering at Tmax occurs due to lattice-distortion which helps to minimize the magnetic energy accompanied by 1D-magnetic ordering or vice-versa. This experimental demonstration initiates a step to design and investigate hybrid organic-inorganic magnetic systems consisting of chiral structure towards ferroelectricity and magnetodielectric coupling. △ Less

Submitted 13 September, 2018; originally announced September 2018.

Comments: Manuscript is accepted in J. Mat. Chem. C as a Communication

Journal ref: Journal of Materials Chemistry C 2018

arXiv:1803.00455 [pdf, other]

doi 10.1109/TPAMI.2019.2892985

LCR-Net++: Multi-person 2D and 3D Pose Detection in Natural Images

Authors: Gregory Rogez, Philippe Weinzaepfel, Cordelia Schmid

Abstract: We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-R… ▽ More We propose an end-to-end architecture for joint 2D and 3D human pose estimation in natural images. Key to our approach is the generation and scoring of a number of pose proposals per image, which allows us to predict 2D and 3D poses of multiple people simultaneously. Hence, our approach does not require an approximate localization of the humans for initialization. Our Localization-Classification-Regression architecture, named LCR-Net, contains 3 main components: 1) the pose proposal generator that suggests candidate poses at different locations in the image; 2) a classifier that scores the different pose proposals; and 3) a regressor that refines pose proposals both in 2D and 3D. All three stages share the convolutional feature layers and are trained jointly. The final pose estimation is obtained by integrating over neighboring pose hypotheses, which is shown to improve over a standard non maximum suppression algorithm. Our method recovers full-body 2D and 3D poses, hallucinating plausible body parts when the persons are partially occluded or truncated by the image boundary. Our approach significantly outperforms the state of the art in 3D pose estimation on Human3.6M, a controlled environment. Moreover, it shows promising results on real images for both single and multi-person subsets of the MPII 2D pose benchmark and demonstrates satisfying 3D pose results even for multi-person images. △ Less

Submitted 13 January, 2019; v1 submitted 1 March, 2018; originally announced March 2018.

Comments: journal version of the CVPR 2017 paper, accepted to appear in IEEE Trans. PAMI

arXiv:1802.04216 [pdf, other]

Image-based Synthesis for Deep 3D Human Pose Estimation

Authors: Grégory Rogez, Cordelia Schmid

Abstract: This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based sy… ▽ More This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D motion capture data. Given a candidate 3D pose, our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a $K$-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms most of the published works in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for real-world images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. Compared to data generated from more classical rendering engines, our synthetic images do not require any domain adaptation or fine-tuning stage. △ Less

Submitted 12 February, 2018; originally announced February 2018.

Comments: accepted to appear in IJCV (with minor revisions). Follow-up to NIPS 2016 arXiv:1607.02046

arXiv:1707.06005 [pdf, other]

Detecting Parts for Action Localization

Authors: Nicolas Chesneau, Grégory Rogez, Karteek Alahari, Cordelia Schmid

Abstract: In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is achieved by training a novel human part detector that scores visible parts while regressing full-body bounding boxes. The core of our method is a convolutional neur… ▽ More In this paper, we propose a new framework for action localization that tracks people in videos and extracts full-body human tubes, i.e., spatio-temporal regions localizing actions, even in the case of occlusions or truncations. This is achieved by training a novel human part detector that scores visible parts while regressing full-body bounding boxes. The core of our method is a convolutional neural network which learns part proposals specific to certain body parts. These are then combined to detect people robustly in each frame. Our tracking algorithm connects the image detections temporally to extract full-body human tubes. We apply our new tube extraction method on the problem of human action localization, on the popular JHMDB dataset, and a very recent challenging dataset DALY (Daily Action Localization in YouTube), showing state-of-the-art results. △ Less

Submitted 21 July, 2017; v1 submitted 19 July, 2017; originally announced July 2017.

Comments: BMVC 2017

arXiv:1607.02046 [pdf, other]

MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild

Authors: Grégory Rogez, Cordelia Schmid

Abstract: This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based sy… ▽ More This paper addresses the problem of 3D human pose estimation in the wild. A significant challenge is the lack of training data, i.e., 2D images of humans annotated with 3D poses. Such data is necessary to train state-of-the-art CNN architectures. Here, we propose a solution to generate a large set of photorealistic synthetic images of humans with 3D pose annotations. We introduce an image-based synthesis engine that artificially augments a dataset of real images with 2D human pose annotations using 3D Motion Capture (MoCap) data. Given a candidate 3D pose our algorithm selects for each joint an image whose 2D pose locally matches the projected 3D pose. The selected images are then combined to generate a new synthetic image by stitching local image patches in a kinematically constrained manner. The resulting images are used to train an end-to-end CNN for full-body 3D pose estimation. We cluster the training data into a large number of pose classes and tackle pose estimation as a K-way classification problem. Such an approach is viable only with large training sets such as ours. Our method outperforms the state of the art in terms of 3D pose estimation in controlled environments (Human3.6M) and shows promising results for in-the-wild images (LSP). This demonstrates that CNNs trained on artificial images generalize well to real images. △ Less

Submitted 28 October, 2016; v1 submitted 7 July, 2016; originally announced July 2016.

Comments: 9 pages, accepted to appear in NIPS 2016

arXiv:1603.09439 [pdf, other]

The Open World of Micro-Videos

Authors: Phuc Xuan Nguyen, Gregory Rogez, Charless Fowlkes, Deva Ramanan

Abstract: Micro-videos are six-second videos popular on social media networks with several unique properties. Firstly, because of the authoring process, they contain significantly more diversity and narrative structure than existing collections of video "snippets". Secondly, because they are often captured by hand-held mobile cameras, they contain specialized viewpoints including third-person, egocentric, a… ▽ More Micro-videos are six-second videos popular on social media networks with several unique properties. Firstly, because of the authoring process, they contain significantly more diversity and narrative structure than existing collections of video "snippets". Secondly, because they are often captured by hand-held mobile cameras, they contain specialized viewpoints including third-person, egocentric, and self-facing views seldom seen in traditional produced video. Thirdly, due to to their continuous production and publication on social networks, aggregate micro-video content contains interesting open-world dynamics that reflects the temporal evolution of tag topics. These aspects make micro-videos an appealing well of visual data for develo** large-scale models for video understanding. We analyze a novel dataset of micro-videos labeled with 58 thousand tags. To analyze this data, we introduce viewpoint-specific and temporally-evolving models for video understanding, defined over state-of-the-art motion and deep visual features. We conclude that our dataset opens up new research opportunities for large-scale video analysis, novel viewpoints, and open-world dynamics. △ Less

Submitted 31 March, 2016; v1 submitted 30 March, 2016; originally announced March 2016.

arXiv:1504.06378 [pdf, other]

Depth-based hand pose estimation: methods, data, and challenges

Authors: James Steven Supancic III, Gregory Rogez, Yi Yang, Jamie Shotton, Deva Ramanan

Abstract: Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and will release all software and evaluation co… ▽ More Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and will release all software and evaluation code. We summarize important conclusions here: (1) Pose estimation appears roughly solved for scenes with isolated hands. However, methods still struggle to analyze cluttered scenes where hands may be interacting with nearby objects and surfaces. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress. △ Less

Submitted 6 May, 2015; v1 submitted 23 April, 2015; originally announced April 2015.

arXiv:1412.0065 [pdf, other]

3D Hand Pose Detection in Egocentric RGB-D Images

Authors: Gregory Rogez, James S. Supancic III, Maryam Khademi, Jose Maria Martinez Montiel, Deva Ramanan

Abstract: We focus on the task of everyday hand pose estimation from egocentric viewpoints. For this task, we show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment. Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand pose estimation in RGB-D images is still an unsolved… ▽ More We focus on the task of everyday hand pose estimation from egocentric viewpoints. For this task, we show that depth sensors are particularly informative for extracting near-field interactions of the camera wearer with his/her environment. Despite the recent advances in full-body pose estimation using Kinect-like sensors, reliable monocular hand pose estimation in RGB-D images is still an unsolved problem. The problem is considerably exacerbated when analyzing hands performing daily activities from a first-person viewpoint, due to severe occlusions arising from object manipulations and a limited field-of-view. Our system addresses these difficulties by exploiting strong priors over viewpoint and pose in a discriminative tracking-by-detection framework. Our priors are operationalized through a photorealistic synthetic model of egocentric scenes, which is used to generate training data for learning depth-based pose classifiers. We evaluate our approach on an annotated dataset of real egocentric object manipulation scenes and compare to both commercial and academic approaches. Our method provides state-of-the-art performance for both hand detection and pose estimation in egocentric RGB-D images. △ Less

Submitted 28 November, 2014; originally announced December 2014.

Comments: 14 pages, 15 figures, extended version of the corresponding ECCV workshop paper, submitted to International Journal of Computer Vision

arXiv:1412.0060 [pdf, other]

Egocentric Pose Recognition in Four Lines of Code

Authors: Gregory Rogez, James S. Supancic III, Deva Ramanan

Abstract: We tackle the problem of estimating the 3D pose of an individual's upper limbs (arms+hands) from a chest mounted depth-camera. Importantly, we consider pose estimation during everyday interactions with objects. Past work shows that strong pose+viewpoint priors and depth-based features are crucial for robust performance. In egocentric views, hands and arms are observable within a well defined volum… ▽ More We tackle the problem of estimating the 3D pose of an individual's upper limbs (arms+hands) from a chest mounted depth-camera. Importantly, we consider pose estimation during everyday interactions with objects. Past work shows that strong pose+viewpoint priors and depth-based features are crucial for robust performance. In egocentric views, hands and arms are observable within a well defined volume in front of the camera. We call this volume an egocentric workspace. A notable property is that hand appearance correlates with workspace location. To exploit this correlation, we classify arm+hand configurations in a global egocentric coordinate frame, rather than a local scanning window. This greatly simplify the architecture and improves performance. We propose an efficient pipeline which 1) generates synthetic workspace exemplars for training using a virtual chest-mounted camera whose intrinsic parameters match our physical camera, 2) computes perspective-aware depth features on this entire volume and 3) recognizes discrete arm+hand pose classes through a sparse multi-class SVM. Our method provides state-of-the-art hand pose recognition performance from egocentric RGB-D images in real-time. △ Less

Submitted 28 November, 2014; originally announced December 2014.

Comments: 9 pages, 10 figures

arXiv:0908.0607 [pdf]

doi 10.1063/1.3192355

Study of molecular spin-crossover complex Fe(phen)2(NCS)2 thin films

Authors: Shengwei Shi, G. Schmerber, J. Arabski, J. -B. Beaufrand, D. J. Kim, S. Boukari, M. Bowen, N. T. Kemp, N. Viart, G. Rogez, E. Beaurepaire, H. Aubriet, J. Petersen, C. Becker, D. Ruch

Abstract: We report on the growth by evaporation under high vacuum of high-quality thin films of Fe(phen)2(NCS)2 (phen=1,10-phenanthroline) that maintain the expected electronic structure down to a thickness of 10 nm and that exhibit a temperature-driven spin transition. We have investigated the current-voltage characteristics of a device based on such films. From the space charge-limited current regime,… ▽ More We report on the growth by evaporation under high vacuum of high-quality thin films of Fe(phen)2(NCS)2 (phen=1,10-phenanthroline) that maintain the expected electronic structure down to a thickness of 10 nm and that exhibit a temperature-driven spin transition. We have investigated the current-voltage characteristics of a device based on such films. From the space charge-limited current regime, we deduce a mobility of 6.5x10-6 cm2/V?s that is similar to the low-range mobility measured on the widely studied tris(8-hydroxyquinoline)aluminium organic semiconductor. This work paves the way for multifunctional molecular devices based on spin-crossover complexes. △ Less

Submitted 5 August, 2009; originally announced August 2009.

Journal ref: Appl. Phys. Lett. 95, 043303 (2009)

Showing 1–37 of 37 results for author: Rogez, G