Search | arXiv e-print repository

TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Authors: Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Yao Feng, Michael J. Black

Abstract: We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an ap… ▽ More We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2311.18836 [pdf, other]

ChatPose: Chatting about 3D Human Pose

Authors: Yao Feng, **g Lin, Sai Kumar Dwivedi, Yu Sun, Priyanka Patel, Michael J. Black

Abstract: We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional huma… ▽ More We introduce ChatPose, a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description, a process that intertwines image interpretation, world knowledge, and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation, lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM, enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs, ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally, ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses, leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries, possibly accompanied by images. We establish benchmarks for these tasks, moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose outperforms existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore, ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis. △ Less

Submitted 23 April, 2024; v1 submitted 30 November, 2023; originally announced November 2023.

Comments: Home page: https://yfeng95.github.io/ChatPose/

arXiv:2308.12965 [pdf, other]

POCO: 3D Pose and Shape Estimation with Confidence

Authors: Sai Kumar Dwivedi, Cordelia Schmid, Hongwei Yi, Michael J. Black, Dimitrios Tzionas

Abstract: The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the con… ▽ More The regression of 3D Human Pose and Shape (HPS) from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames. Code and models will be available for research at https://poco.is.tue.mpg.de. △ Less

Submitted 24 August, 2023; originally announced August 2023.

arXiv:2303.03373 [pdf, other]

Detecting Human-Object Contact in Images

Authors: Yixin Chen, Sai Kumar Dwivedi, Michael J. Black, Dimitrios Tzionas

Abstract: Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-obje… ▽ More Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability. △ Less

Submitted 4 April, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Accepted at CVPR 2023

arXiv:2110.03480 [pdf, other]

Learning to Regress Bodies from Images using Differentiable Semantic Rendering

Authors: Sai Kumar Dwivedi, Nikos Athanasiou, Muhammed Kocabas, Michael J. Black

Abstract: Learning to regress 3D human body shape and pose (e.g.~SMPL parameters) from monocular images typically exploits losses on 2D keypoints, silhouettes, and/or part-segmentation when 3D training data is not available. Such losses, however, are limited because 2D keypoints do not supervise body shape and segmentations of people in clothing do not match projected minimally-clothed SMPL shapes. To explo… ▽ More Learning to regress 3D human body shape and pose (e.g.~SMPL parameters) from monocular images typically exploits losses on 2D keypoints, silhouettes, and/or part-segmentation when 3D training data is not available. Such losses, however, are limited because 2D keypoints do not supervise body shape and segmentations of people in clothing do not match projected minimally-clothed SMPL shapes. To exploit richer image information about clothed people, we introduce higher-level semantic information about clothing to penalize clothed and non-clothed regions of the image differently. To do so, we train a body regressor using a novel Differentiable Semantic Rendering - DSR loss. For Minimally-Clothed regions, we define the DSR-MC loss, which encourages a tight match between a rendered SMPL body and the minimally-clothed regions of the image. For clothed regions, we define the DSR-C loss to encourage the rendered SMPL body to be inside the clothing mask. To ensure end-to-end differentiable training, we learn a semantic clothing prior for SMPL vertices from thousands of clothed human scans. We perform extensive qualitative and quantitative experiments to evaluate the role of clothing semantics on the accuracy of 3D human pose and shape estimation. We outperform all previous state-of-the-art methods on 3DPW and Human3.6M and obtain on par results on MPI-INF-3DHP. Code and trained models are available for research at https://dsr.is.tue.mpg.de/. △ Less

Submitted 23 February, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

Comments: ICCV2021

arXiv:1909.07945 [pdf, other]

ProtoGAN: Towards Few Shot Learning for Action Recognition

Authors: Sai Kumar Dwivedi, Vikram Gupta, Rahul Mitra, Shuaib Ahmed, Arjun Jain

Abstract: Few-shot learning (FSL) for action recognition is a challenging task of recognizing novel action categories which are represented by few instances in the training data. In a more generalized FSL setting (G-FSL), both seen as well as novel action categories need to be recognized. Conventional classifiers suffer due to inadequate data in FSL setting and inherent bias towards seen action categories i… ▽ More Few-shot learning (FSL) for action recognition is a challenging task of recognizing novel action categories which are represented by few instances in the training data. In a more generalized FSL setting (G-FSL), both seen as well as novel action categories need to be recognized. Conventional classifiers suffer due to inadequate data in FSL setting and inherent bias towards seen action categories in G-FSL setting. In this paper, we address this problem by proposing a novel ProtoGAN framework which synthesizes additional examples for novel categories by conditioning a conditional generative adversarial network with class prototype vectors. These class prototype vectors are learnt using a Class Prototype Transfer Network (CPTN) from examples of seen categories. Our synthesized examples for a novel class are semantically similar to real examples belonging to that class and is used to train a model exhibiting better generalization towards novel classes. We support our claim by performing extensive experiments on three datasets: UCF101, HMDB51 and Olympic-Sports. To the best of our knowledge, we are the first to report the results for G-FSL and provide a strong benchmark for future research. We also outperform the state-of-the-art method in FSL for all the aforementioned datasets. △ Less

Submitted 17 September, 2019; originally announced September 2019.

Comments: 9 pages, 5 tables, 2 figures. To appear in the proceedings of ICCV Workshop 2019

arXiv:1909.06672 [pdf, other]

Progression Modelling for Online and Early Gesture Detection

Authors: Vikram Gupta, Sai Kumar Dwivedi, Rishabh Dabral, Arjun Jain

Abstract: Online and Early detection of gestures is crucial for building touchless gesture based interfaces. These interfaces should operate on a stream of video frames instead of the complete video and detect the presence of gestures at an earlier stage than post-completion for providing real time user experience. To achieve this, it is important to recognize the progression of the gesture across different… ▽ More Online and Early detection of gestures is crucial for building touchless gesture based interfaces. These interfaces should operate on a stream of video frames instead of the complete video and detect the presence of gestures at an earlier stage than post-completion for providing real time user experience. To achieve this, it is important to recognize the progression of the gesture across different stages so that appropriate responses can be triggered on reaching the desired execution stage. To address this, we propose a simple yet effective multi-task learning framework which models the progression of the gesture along with frame level recognition. The proposed framework recognizes the gestures at an early stage with high precision and also achieves state-of-the-art recognition accuracy of 87.8% which is closer to human accuracy of 88.4% on the NVIDIA gesture dataset in the offline configuration and advances the state-of-the-art by more than 4%. We also introduce tightly segmented annotations for the NVIDIA gesture dataset and setup a strong baseline for gesture localization for this dataset. We also evaluate our framework on the Montalbano dataset and report competitive results. △ Less

Submitted 14 September, 2019; originally announced September 2019.

Comments: 3DV 2019 Oral paper

arXiv:1407.2423 [pdf]

Desiging a logical security framework for e-commerce system based on soa

Authors: Ashish Kr. Luhach, Sanjay K. Dwivedi, C. K. Jha

Abstract: Rapid increases in information technology also changed the existing markets and transformed them into e- markets (e-commerce) from physical markets. Equally with the e-commerce evolution, enterprises have to recover a safer approach for implementing E-commerce and maintaining its logical security. SOA is one of the best techniques to fulfill these requirements. SOA holds the vantage of being easy… ▽ More Rapid increases in information technology also changed the existing markets and transformed them into e- markets (e-commerce) from physical markets. Equally with the e-commerce evolution, enterprises have to recover a safer approach for implementing E-commerce and maintaining its logical security. SOA is one of the best techniques to fulfill these requirements. SOA holds the vantage of being easy to use, flexible, and recyclable. With the advantages, SOA is also endowed with ease for message tampering and unauthorized access. This causes the security technology implementation of E-commerce very difficult at other engineering sciences. This paper discusses the importance of using SOA in E-commerce and identifies the flaws in the existing security analysis of E-commerce platforms. On the foundation of identifying defects, this editorial also suggested an implementation design of the logical security framework for SOA supported E-commerce system. △ Less

Submitted 9 July, 2014; originally announced July 2014.

arXiv:1407.2421 [pdf]

Designing and implementing the logical security framework for e-commerce based on service oriented architecture

Authors: Ashish Kr. Luhach, Sanjay K Dwivedi, C K Jha

Abstract: Rapid evolution of information technology has contributed to the evolution of more sophisticated E- commerce system with the better transaction time and protection. The currently used E-commerce models lack in quality properties such as logical security because of their poor designing and to face the highly equipped and trained intruders. This editorial proposed a security framework for small and… ▽ More Rapid evolution of information technology has contributed to the evolution of more sophisticated E- commerce system with the better transaction time and protection. The currently used E-commerce models lack in quality properties such as logical security because of their poor designing and to face the highly equipped and trained intruders. This editorial proposed a security framework for small and medium sized E-commerce, based on service oriented architecture and gives an analysis of the eminent security attacks which can be averted. The proposed security framework will be implemented and validated on an open source E-commerce, and the results achieved so far are also presented. △ Less

Submitted 9 July, 2014; originally announced July 2014.

arXiv:1406.4607 [pdf, ps, other]

doi 10.1371/journal.pone.0088249

Uncovering Randomness and Success in Society

Authors: Sarika Jalan, Camellia Sarkar, Anagha Madhusudanan, Sanjiv Kumar Dwivedi

Abstract: An understanding of how individuals shape and impact the evolution of society is vastly limited due to the unavailability of large-scale reliable datasets that can simultaneously capture information regarding individual movements and social interactions. We believe that the popular Indian film industry, 'Bollywood', can provide a social network apt for such a study. Bollywood provides massive amou… ▽ More An understanding of how individuals shape and impact the evolution of society is vastly limited due to the unavailability of large-scale reliable datasets that can simultaneously capture information regarding individual movements and social interactions. We believe that the popular Indian film industry, 'Bollywood', can provide a social network apt for such a study. Bollywood provides massive amounts of real, unbiased data that spans more than 100 years, and hence this network has been used as a model for the present paper. The nodes which maintain a moderate degree or widely cooperate with the other nodes of the network tend to be more fit (measured as the success of the node in the industry) in comparison to the other nodes. The analysis carried forth in the current work, using a conjoined framework of complex network theory and random matrix theory, aims to quantify the elements that determine the fitness of an individual node and the factors that contribute to the robustness of a network. The authors of this paper believe that the method of study used in the current paper can be extended to study various other industries and organizations. △ Less

Submitted 8 July, 2014; v1 submitted 18 June, 2014; originally announced June 2014.

Comments: 39 pages, 12 figures, 14 tables

Journal ref: PloS one, 9(2), e88249 (2014)

arXiv:1111.5293 [pdf]

Rule based Part of speech Tagger for Homoeopathy Clinical realm

Authors: Sanjay K. Dwivedi, Pramod P. Sukhadeve

Abstract: A tagger is a mandatory segment of most text scrutiny systems, as it consigned a s yntax class (e.g., noun, verb, adjective, and adverb) to every word in a sentence. In this paper, we present a simple part of speech tagger for homoeopathy clinical language. This paper reports about the anticipated part of speech tagger for homoeopathy clinical language. It exploit standard pattern for evaluating s… ▽ More A tagger is a mandatory segment of most text scrutiny systems, as it consigned a s yntax class (e.g., noun, verb, adjective, and adverb) to every word in a sentence. In this paper, we present a simple part of speech tagger for homoeopathy clinical language. This paper reports about the anticipated part of speech tagger for homoeopathy clinical language. It exploit standard pattern for evaluating sentences, untagged clinical corpus of 20085 words is used, from which we had selected 125 sentences (2322 tokens). The problem of tagging in natural language processing is to find a way to tag every word in a text as a meticulous part of speech. The basic idea is to apply a set of rules on clinical sentences and on each word, Accuracy is the leading factor in evaluating any POS tagger so the accuracy of proposed tagger is also conversed. △ Less

Submitted 13 November, 2011; originally announced November 2011.

Showing 1–11 of 11 results for author: Dwivedi, S K