Search | arXiv e-print repository

UPose3D: Uncertainty-Aware 3D Human Pose Estimation with Cross-View and Temporal Cues

Authors: Vandad Davoodnia, Saeed Ghorbani, Marc-André Carbonneau, Alexandre Messier, Ali Etemad

Abstract: We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image… ▽ More We introduce UPose3D, a novel approach for multi-view 3D human pose estimation, addressing challenges in accuracy and scalability. Our method advances existing pose estimation frameworks by improving robustness and flexibility without requiring direct 3D annotations. At the core of our method, a pose compiler module refines predictions from a 2D keypoints estimator that operates on a single image by leveraging temporal and cross-view information. Our novel cross-view fusion strategy is scalable to any number of cameras, while our synthetic data generation strategy ensures generalization across diverse actors, scenes, and viewpoints. Finally, UPose3D leverages the prediction uncertainty of both the 2D keypoint estimator and the pose compiler module. This provides robustness to outliers and noisy data, resulting in state-of-the-art performance in out-of-distribution settings. In addition, for in-distribution settings, UPose3D yields a performance rivaling methods that rely on 3D annotated data, while being the state-of-the-art among methods relying only on 2D supervision. △ Less

Submitted 14 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 18 pages, 12 figures

arXiv:2404.12625 [pdf, other]

SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers

Authors: Vandad Davoodnia, Saeed Ghorbani, Alexandre Messier, Ali Etemad

Abstract: We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavil… ▽ More We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations. This module integrates prior knowledge about pose space and infers the full pose state at runtime. Separating the 3D keypoint detection and inverse-kinematic problems, along with the expressive representations learned by our skeletal transformer, enhance the generalization of our method to unseen noisy data. We evaluate our method on three public datasets in both in-distribution and out-of-distribution settings using three datasets, and observe strong performance with respect to prior works. Moreover, ablation experiments demonstrate the impact of each of the modules of our architecture. Finally, we study the performance of our method in dealing with noise and heavy occlusions and find considerable robustness with respect to other solutions. △ Less

Submitted 19 April, 2024; originally announced April 2024.

Comments: 12 pages, 8 figures

arXiv:2209.07556 [pdf, other]

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech

Authors: Saeed Ghorbani, Ylva Ferstl, Daniel Holden, Nikolaus F. Troje, Marc-André Carbonneau

Abstract: We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scalin… ▽ More We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles. △ Less

Submitted 23 September, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

arXiv:2206.06518 [pdf, other]

doi 10.1007/s10489-021-02418-y

Estimating Pose from Pressure Data for Smart Beds with Deep Image-based Pose Estimators

Authors: Vandad Davoodnia, Saeed Ghorbani, Ali Etemad

Abstract: In-bed pose estimation has shown value in fields such as hospital patient monitoring, sleep studies, and smart homes. In this paper, we explore different strategies for detecting body pose from highly ambiguous pressure data, with the aid of pre-existing pose estimators. We examine the performance of pre-trained pose estimators by using them either directly or by re-training them on two pressure d… ▽ More In-bed pose estimation has shown value in fields such as hospital patient monitoring, sleep studies, and smart homes. In this paper, we explore different strategies for detecting body pose from highly ambiguous pressure data, with the aid of pre-existing pose estimators. We examine the performance of pre-trained pose estimators by using them either directly or by re-training them on two pressure datasets. We also explore other strategies utilizing a learnable pre-processing domain adaptation step, which transforms the vague pressure maps to a representation closer to the expected input space of common purpose pose estimation modules. Accordingly, we used a fully convolutional network with multiple scales to provide the pose-specific characteristics of the pressure maps to the pre-trained pose estimation module. Our complete analysis of different approaches shows that the combination of learnable pre-processing module along with re-training pre-existing image-based pose estimators on the pressure data is able to overcome issues such as highly vague pressure points to achieve very high pose estimation accuracy. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: The version of record of this article, first published in Applied Intelligence, is available online at Publisher's website https://doi.org/10.1007/s10489-021-02418-y. arXiv admin note: substantial text overlap with arXiv:1908.08919

Report number: 1573-7497

Journal ref: Applied Intelligence (2021): 1-15

arXiv:2011.04084 [pdf, other]

Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations

Authors: Shahram Ghorbani, Yashesh Gaur, Yu Shi, **yu Li

Abstract: In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals fro… ▽ More In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text representations extracted by a self-supervised pre-trained text-video embedding model. Firstly, we propose a multi-stream attention architecture to leverage signals from both audio and video modalities. This architecture consists of separate encoders for the two modalities and a single decoder that attends over them. We show that this architecture is better than fusing modalities at the signal level. Additionally, we also explore leveraging the visual information in a second pass model, which has also been referred to as a `deliberation model'. The deliberation model accepts audio representations and text hypotheses from the first pass ASR and combines them with a visual stream for an improved visual context-aware recognition. The proposed deliberation scheme can work on top of any well trained ASR and also enabled us to leverage the pre-trained text model to ground the hypotheses with the visual features. Our experiments on HOW2 dataset show that multi-stream and deliberation architectures are very effective at the VC-ASR task. We evaluate the proposed models for two scenarios; clean audio stream and distorted audio in which we mask out some specific words in the audio. The deliberation model outperforms the multi-stream model and achieves a relative WER improvement of 6% and 8.7% for the clean and masked data, respectively, compared to an audio-only model. The deliberation model also improves recovering the masked words by 59% relative. △ Less

Submitted 8 November, 2020; originally announced November 2020.

Comments: Accepted at SLT 2021

arXiv:2010.09950 [pdf, other]

doi 10.1111/cgf.14116

Probabilistic Character Motion Synthesis using a Hierarchical Deep Latent Variable Model

Authors: Saeed Ghorbani, Calden Wloka, Ali Etemad, Marcus A. Brubaker, Nikolaus F. Troje

Abstract: We present a probabilistic framework to generate character animations based on weak control signals, such that the synthesized motions are realistic while retaining the stochastic nature of human movement. The proposed architecture, which is designed as a hierarchical recurrent model, maps each sub-sequence of motions into a stochastic latent code using a variational autoencoder extended over the… ▽ More We present a probabilistic framework to generate character animations based on weak control signals, such that the synthesized motions are realistic while retaining the stochastic nature of human movement. The proposed architecture, which is designed as a hierarchical recurrent model, maps each sub-sequence of motions into a stochastic latent code using a variational autoencoder extended over the temporal domain. We also propose an objective function which respects the impact of each joint on the pose and compares the joint angles based on angular distance. We use two novel quantitative protocols and human qualitative assessment to demonstrate the ability of our model to generate convincing and diverse periodic and non-periodic motion sequences without the need for strong control signals. △ Less

Submitted 19 October, 2020; originally announced October 2020.

Journal ref: Computer Graphics Forum, 39 (2002), 39-Issue 8

arXiv:2010.09084 [pdf, other]

Gait Recognition using Multi-Scale Partial Representation Transformation with Capsules

Authors: Alireza Sepas-Moghaddam, Saeed Ghorbani, Nikolaus F. Troje, Ali Etemad

Abstract: Gait recognition, referring to the identification of individuals based on the manner in which they walk, can be very challenging due to the variations in the viewpoint of the camera and the appearance of individuals. Current methods for gait recognition have been dominated by deep learning models, notably those based on partial feature representations. In this context, we propose a novel deep netw… ▽ More Gait recognition, referring to the identification of individuals based on the manner in which they walk, can be very challenging due to the variations in the viewpoint of the camera and the appearance of individuals. Current methods for gait recognition have been dominated by deep learning models, notably those based on partial feature representations. In this context, we propose a novel deep network, learning to transfer multi-scale partial gait representations using capsules to obtain more discriminative gait features. Our network first obtains multi-scale partial representations using a state-of-the-art deep partial feature extractor. It then recurrently learns the correlations and co-occurrences of the patterns among the partial features in forward and backward directions using Bi-directional Gated Recurrent Units (BGRU). Finally, a capsule network is adopted to learn deeper part-whole relationships and assigns more weights to the more relevant features while ignoring the spurious dimensions. That way, we obtain final features that are more robust to both viewing and appearance changes. The performance of our method has been extensively tested on two gait recognition datasets, CASIA-B and OU-MVLP, using four challenging test protocols. The results of our method have been compared to the state-of-the-art gait recognition solutions, showing the superiority of our model, notably when facing challenging viewing and carrying conditions. △ Less

Submitted 18 October, 2020; originally announced October 2020.

Comments: Accepted to International Conference on Pattern Recognition (ICPR) 2020

arXiv:2007.09131 [pdf, other]

doi 10.21437/Interspeech.2020-2048

SkipConvNet: Skip Convolutional Neural Network for Speech Dereverberation using Optimally Smoothed Spectral Map**

Authors: Vinay Kothapally, Wei Xia, Shahram Ghorbani, John H. L. Hansen, Wei Xue, **g Huang

Abstract: The reliability of using fully convolutional networks (FCNs) has been successfully demonstrated by recent studies in many speech applications. One of the most popular variants of these FCNs is the `U-Net', which is an encoder-decoder network with skip connections. In this study, we propose `SkipConvNet' where we replace each skip connection with multiple convolutional modules to provide decoder wi… ▽ More The reliability of using fully convolutional networks (FCNs) has been successfully demonstrated by recent studies in many speech applications. One of the most popular variants of these FCNs is the `U-Net', which is an encoder-decoder network with skip connections. In this study, we propose `SkipConvNet' where we replace each skip connection with multiple convolutional modules to provide decoder with intuitive feature maps rather than encoder's output to improve the learning capacity of the network. We also propose the use of optimal smoothing of power spectral density (PSD) as a pre-processing step, which helps to further enhance the efficiency of the network. To evaluate our proposed system, we use the REVERB challenge corpus to assess the performance of various enhancement approaches under the same conditions. We focus solely on monitoring improvements in speech quality and their contribution to improving the efficiency of back-end speech systems, such as speech recognition and speaker verification, trained on only clean speech. Experimental findings show that the proposed system consistently outperforms other approaches. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: Submitted to Interspeech2020

arXiv:2003.01888 [pdf, other]

doi 10.1371/journal.pone.0253157

MoVi: A Large Multipurpose Motion and Video Dataset

Authors: Saeed Ghorbani, Kimia Mahdaviani, Anne Thaler, Konrad Kording, Douglas James Cook, Gunnar Blohm, Nikolaus F. Troje

Abstract: Human movements are both an area of intense study and the basis of many applications such as character animation. For many applications, it is crucial to identify movements from videos or analyze datasets of movements. Here we introduce a new human Motion and Video dataset MoVi, which we make available publicly. It contains 60 female and 30 male actors performing a collection of 20 predefined ever… ▽ More Human movements are both an area of intense study and the basis of many applications such as character animation. For many applications, it is crucial to identify movements from videos or analyze datasets of movements. Here we introduce a new human Motion and Video dataset MoVi, which we make available publicly. It contains 60 female and 30 male actors performing a collection of 20 predefined everyday actions and sports movements, and one self-chosen movement. In five capture rounds, the same actors and movements were recorded using different hardware systems, including an optical motion capture system, video cameras, and inertial measurement units (IMU). For some of the capture rounds, the actors were recorded when wearing natural clothing, for the other rounds they wore minimal clothing. In total, our dataset contains 9 hours of motion capture data, 17 hours of video data from 4 different points of view (including one hand-held camera), and 6.6 hours of IMU data. In this paper, we describe how the dataset was collected and post-processed; We present state-of-the-art estimates of skeletal motions and full-body shape deformations associated with skeletal motion. We discuss examples for potential studies this dataset could enable. △ Less

Submitted 3 March, 2020; originally announced March 2020.

arXiv:2001.01656 [pdf, other]

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Authors: Jianwei Yu, Shi-Xiong Zhang, Jian Wu, Shahram Ghorbani, Bo Wu, Shiyin Kang, Shansong Liu, Xunying Liu, Helen Meng, Dong Yu

Abstract: Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-… ▽ More Automatic recognition of overlapped speech remains a highly challenging task to date. Motivated by the bimodal nature of human speech perception, this paper investigates the use of audio-visual technologies for overlapped speech recognition. Three issues associated with the construction of audio-visual speech recognition (AVSR) systems are addressed. First, the basic architecture designs i.e. end-to-end and hybrid of AVSR systems are investigated. Second, purposefully designed modality fusion gates are used to robustly integrate the audio and visual features. Third, in contrast to a traditional pipelined architecture containing explicit speech separation and recognition components, a streamlined and integrated AVSR system optimized consistently using the lattice-free MMI (LF-MMI) discriminative criterion is also proposed. The proposed LF-MMI time-delay neural network (TDNN) system establishes the state-of-the-art for the LRS2 dataset. Experiments on overlapped speech simulated from the LRS2 dataset suggest the proposed AVSR system outperformed the audio only baseline LF-MMI DNN system by up to 29.98\% absolute in word error rate (WER) reduction, and produced recognition performance comparable to a more complex pipelined system. Consistent performance improvements of 4.89\% absolute in WER reduction over the baseline AVSR system using feature fusion are also obtained. △ Less

Submitted 6 January, 2020; originally announced January 2020.

Comments: 5 pages, 5 figures, submitted to icassp2019

arXiv:1911.05126 [pdf, ps, other]

KPsec: Secure End-to-End Communications for Multi-Hop Wireless Networks

Authors: Mohammed Gharib, Ali Owfi, Soudeh Ghorbani

Abstract: The security of cyber-physical systems, from self-driving cars to medical devices, depends on their underlying multi-hop wireless networks. Yet, the lack of trusted central infrastructures and limited nodes' resources make securing these networks challenging. Recent works on key pre-distribution schemes, where nodes communicate over encrypted overlay paths, provide an appealing solution because of… ▽ More The security of cyber-physical systems, from self-driving cars to medical devices, depends on their underlying multi-hop wireless networks. Yet, the lack of trusted central infrastructures and limited nodes' resources make securing these networks challenging. Recent works on key pre-distribution schemes, where nodes communicate over encrypted overlay paths, provide an appealing solution because of their distributed, computationally light-weight nature. Alas, these schemes share a glaring security vulnerability: the two ends of every overlay link can decrypt---and potentially modify and alter---the message. Plus, the longer overlay paths impose traffic overhead and increase latency. We present a novel routing mechanism, KPsec, to address these issues. KPsec deploys multiple disjoint paths and an initial key-exchange phase to secure end-to-end communications. After the initial key-exchange phase, traffic in KPsec follows the shortest paths and, in contrast to key pre-distribution schemes, intermediate nodes cannot decrypt it. We measure the security and performance of KPsec as well as three state-of-the-art key pre-distribution schemes using a real 10-node testbed and large-scale simulations. Our experiments show that, in addition to its security benefits, KPsec results in $5-15\%$ improvement in network throughput, up to $75\%$ reduction in latency, and an order of magnitude reduction in energy consumption. △ Less

Submitted 12 November, 2019; originally announced November 2019.

Comments: 20 pages, 10 figures, 3 tables, testbed experiment, exhaustive performance evaluation

arXiv:1910.00565 [pdf, ps, other]

Domain Expansion in DNN-based Acoustic Models for Robust Speech Recognition

Authors: Shahram Ghorbani, Soheil Khorram, John H. L. Hansen

Abstract: Training acoustic models with sequentially incoming data -- while both leveraging new data and avoiding the forgetting effect-- is an essential obstacle to achieving human intelligence level in speech recognition. An obvious approach to leverage data from a new domain (e.g., new accented speech) is to first generate a comprehensive dataset of all domains, by combining all available data, and then… ▽ More Training acoustic models with sequentially incoming data -- while both leveraging new data and avoiding the forgetting effect-- is an essential obstacle to achieving human intelligence level in speech recognition. An obvious approach to leverage data from a new domain (e.g., new accented speech) is to first generate a comprehensive dataset of all domains, by combining all available data, and then use this dataset to retrain the acoustic models. However, as the amount of training data grows, storing and retraining on such a large-scale dataset becomes practically impossible. To deal with this problem, in this study, we study several domain expansion techniques which exploit only the data of the new domain to build a stronger model for all domains. These techniques are aimed at learning the new domain with a minimal forgetting effect (i.e., they maintain original model performance). These techniques modify the adaptation procedure by imposing new constraints including (1) weight constraint adaptation (WCA): kee** the model parameters close to the original model parameters; (2) elastic weight consolidation (EWC): slowing down training for parameters that are important for previously established domains; (3) soft KL-divergence (SKLD): restricting the KL-divergence between the original and the adapted model output distributions; and (4) hybrid SKLD-EWC: incorporating both SKLD and EWC constraints. We evaluate these techniques in an accent adaptation task in which we adapt a deep neural network (DNN) acoustic model trained with native English to three different English accents: Australian, Hispanic, and Indian. The experimental results show that SKLD significantly outperforms EWC, and EWC works better than WCA. The hybrid SKLD-EWC technique results in the best overall performance. △ Less

Submitted 1 October, 2019; originally announced October 2019.

Comments: Accepted at ASRU, 2019

arXiv:1908.08919 [pdf, other]

doi 10.1109/ICASSP39728.2021.9413516

In-bed Pressure-based Pose Estimation using Image Space Representation Learning

Authors: Vandad Davoodnia, Saeed Ghorbani, Ali Etemad

Abstract: Recent advances in deep pose estimation models have proven to be effective in a wide range of applications such as health monitoring, sports, animations, and robotics. However, pose estimation models fail to generalize when facing images acquired from in-bed pressure sensing systems. In this paper, we address this challenge by presenting a novel end-to-end framework capable of accurately locating… ▽ More Recent advances in deep pose estimation models have proven to be effective in a wide range of applications such as health monitoring, sports, animations, and robotics. However, pose estimation models fail to generalize when facing images acquired from in-bed pressure sensing systems. In this paper, we address this challenge by presenting a novel end-to-end framework capable of accurately locating body parts from vague pressure data. Our method exploits the idea of equip** an off-the-shelf pose estimator with a deep trainable neural network, which pre-processes and prepares the pressure data for subsequent pose estimation. Our model transforms the ambiguous pressure maps to images containing shapes and structures similar to the common input domain of the pre-existing pose estimation methods. As a result, we show that our model is able to reconstruct unclear body parts, which in turn enables pose estimators to accurately and robustly estimate the pose. We train and test our method on a manually annotated public pressure map dataset using a combination of loss functions. Results confirm the effectiveness of our method by the high visual quality in the generated images and the high pose estimation rates achieved. △ Less

Submitted 18 May, 2021; v1 submitted 20 August, 2019; originally announced August 2019.

Comments: \c{opyright}2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Journal ref: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3965-3969). IEEE

arXiv:1907.13580 [pdf, other]

doi 10.1007/978-3-030-22514-8_14

Auto-labelling of Markers in Optical Motion Capture by Permutation Learning

Authors: Saeed Ghorbani, Ali Etemad, Nikolaus F. Troje

Abstract: Optical marker-based motion capture is a vital tool in applications such as motion and behavioural analysis, animation, and biomechanics. Labelling, that is, assigning optical markers to the pre-defined positions on the body is a time consuming and labour intensive postprocessing part of current motion capture pipelines. The problem can be considered as a ranking process in which markers shuffled… ▽ More Optical marker-based motion capture is a vital tool in applications such as motion and behavioural analysis, animation, and biomechanics. Labelling, that is, assigning optical markers to the pre-defined positions on the body is a time consuming and labour intensive postprocessing part of current motion capture pipelines. The problem can be considered as a ranking process in which markers shuffled by an unknown permutation matrix are sorted to recover the correct order. In this paper, we present a framework for automatic marker labelling which first estimates a permutation matrix for each individual frame using a differentiable permutation learning model and then utilizes temporal consistency to identify and correct remaining labelling errors. Experiments conducted on the test data show the effectiveness of our framework. △ Less

Submitted 31 July, 2019; originally announced July 2019.

Journal ref: Computer Graphics International Conference, pp. 167-178. Springer, Cham, 2019

Showing 1–14 of 14 results for author: Ghorbani, S