Search | arXiv e-print repository

Multiple Toddler Tracking in Indoor Videos

Authors: Somaieh Amraee, Bishoy Galoaa, Matthew Goodwin, Elaheh Hatamimajoumerd, Sarah Ostadabbas

Abstract: Multiple toddler tracking (MTT) involves identifying and differentiating toddlers in video footage. While conventional multi-object tracking (MOT) algorithms are adept at tracking diverse objects, toddlers pose unique challenges due to their unpredictable movements, various poses, and similar appearance. Tracking toddlers in indoor environments introduces additional complexities such as occlusions… ▽ More Multiple toddler tracking (MTT) involves identifying and differentiating toddlers in video footage. While conventional multi-object tracking (MOT) algorithms are adept at tracking diverse objects, toddlers pose unique challenges due to their unpredictable movements, various poses, and similar appearance. Tracking toddlers in indoor environments introduces additional complexities such as occlusions and limited fields of view. In this paper, we address the challenges of MTT and propose MTTSort, a customized method built upon the DeepSort algorithm. MTTSort is designed to track multiple toddlers in indoor videos accurately. Our contributions include discussing the primary challenges in MTT, introducing a genetic algorithm to optimize hyperparameters, proposing an accurate tracking algorithm, and curating the MTTrack dataset using unbiased AI co-labeling techniques. We quantitatively compare MTTSort to state-of-the-art MOT methods on MTTrack, DanceTrack, and MOT15 datasets. In our evaluation, the proposed method outperformed other MOT methods, achieving 0.98, 0.68, and 0.98 in multiple object tracking accuracy (MOTA), higher order tracking accuracy (HOTA), and iterative and discriminative framework 1 (IDF1) metrics, respectively. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2311.12300 [pdf, other]

Challenges in Video-Based Infant Action Recognition: A Critical Examination of the State of the Art

Authors: Elaheh Hatamimajoumerd, Pooria Daneshvar Kakhaki, Xiaofei Huang, Lingfei Luan, Somaieh Amraee, Sarah Ostadabbas

Abstract: Automated human action recognition, a burgeoning field within computer vision, boasts diverse applications spanning surveillance, security, human-computer interaction, tele-health, and sports analysis. Precise action recognition in infants serves a multitude of pivotal purposes, encompassing safety monitoring, developmental milestone tracking, early intervention for developmental delays, fostering… ▽ More Automated human action recognition, a burgeoning field within computer vision, boasts diverse applications spanning surveillance, security, human-computer interaction, tele-health, and sports analysis. Precise action recognition in infants serves a multitude of pivotal purposes, encompassing safety monitoring, developmental milestone tracking, early intervention for developmental delays, fostering parent-infant bonds, advancing computer-aided diagnostics, and contributing to the scientific comprehension of child development. This paper delves into the intricacies of infant action recognition, a domain that has remained relatively uncharted despite the accomplishments in adult action recognition. In this study, we introduce a groundbreaking dataset called ``InfActPrimitive'', encompassing five significant infant milestone action categories, and we incorporate specialized preprocessing for infant data. We conducted an extensive comparative analysis employing cutting-edge skeleton-based action recognition models using this dataset. Our findings reveal that, although the PoseC3D model achieves the highest accuracy at approximately 71%, the remaining models struggle to accurately capture the dynamics of infant actions. This highlights a substantial knowledge gap between infant and adult action recognition domains and the urgent need for data-efficient pipeline models. △ Less

Submitted 20 November, 2023; originally announced November 2023.

arXiv:2310.16138 [pdf, other]

Subtle Signals: Video-based Detection of Infant Non-nutritive Sucking as a Neurodevelopmental Cue

Authors: Shaotong Zhu, Michael Wan, Sai Kumar Reddy Manne, Emily Zimmerman, Sarah Ostadabbas

Abstract: Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor d… ▽ More Non-nutritive sucking (NNS), which refers to the act of sucking on a pacifier, finger, or similar object without nutrient intake, plays a crucial role in assessing healthy early development. In the case of preterm infants, NNS behavior is a key component in determining their readiness for feeding. In older infants, the characteristics of NNS behavior offer valuable insights into neural and motor development. Additionally, NNS activity has been proposed as a potential safeguard against sudden infant death syndrome (SIDS). However, the clinical application of NNS assessment is currently hindered by labor-intensive and subjective finger-in-mouth evaluations. Consequently, researchers often resort to expensive pressure transducers for objective NNS signal measurement. To enhance the accessibility and reliability of NNS signal monitoring for both clinicians and researchers, we introduce a vision-based algorithm designed for non-contact detection of NNS activity using baby monitor footage in natural settings. Our approach involves a comprehensive exploration of optical flow and temporal convolutional networks, enabling the detection and amplification of subtle infant-sucking signals. We successfully classify short video clips of uniform length into NNS and non-NNS periods. Furthermore, we investigate manual and learning-based techniques to piece together local classification results, facilitating the segmentation of longer mixed-activity videos into NNS and non-NNS segments of varying duration. Our research introduces two novel datasets of annotated infant videos, including one sourced from our clinical study featuring 19 infant subjects and 183 hours of overnight baby monitor footage. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2307.13110 [pdf, other]

Automatic Infant Respiration Estimation from Video: A Deep Flow-based Algorithm and a Novel Public Benchmark

Authors: Sai Kumar Reddy Manne, Shaotong Zhu, Sarah Ostadabbas, Michael Wan

Abstract: Respiration is a critical vital sign for infants, and continuous respiratory monitoring is particularly important for newborns. However, neonates are sensitive and contact-based sensors present challenges in comfort, hygiene, and skin health, especially for preterm babies. As a step toward fully automatic, continuous, and contactless respiratory monitoring, we develop a deep-learning method for es… ▽ More Respiration is a critical vital sign for infants, and continuous respiratory monitoring is particularly important for newborns. However, neonates are sensitive and contact-based sensors present challenges in comfort, hygiene, and skin health, especially for preterm babies. As a step toward fully automatic, continuous, and contactless respiratory monitoring, we develop a deep-learning method for estimating respiratory rate and waveform from plain video footage in natural settings. Our automated infant respiration flow-based network (AIRFlowNet) combines video-extracted optical flow input and spatiotemporal convolutional processing tuned to the infant domain. We support our model with the first public annotated infant respiration dataset with 125 videos (AIR-125), drawn from eight infant subjects, set varied pose, lighting, and camera conditions. We include manual respiration annotations and optimize AIRFlowNet training on them using a novel spectral bandpass loss function. When trained and tested on the AIR-125 infant data, our method significantly outperforms other state-of-the-art methods in respiratory rate estimation, achieving a mean absolute error of $\sim$2.9 breaths per minute, compared to $\sim$4.7--6.2 for other public models designed for adult subjects and more uniform environments. △ Less

Submitted 24 July, 2023; originally announced July 2023.

arXiv:2306.02631 [pdf, other]

Bridging the Domain Gap between Synthetic and Real-World Data for Autonomous Driving

Authors: Xiangyu Bai, Yedi Luo, Le Jiang, Aniket Gupta, Pushyami Kaveti, Hanumant Singh, Sarah Ostadabbas

Abstract: Modern autonomous systems require extensive testing to ensure reliability and build trust in ground vehicles. However, testing these systems in the real-world is challenging due to the lack of large and diverse datasets, especially in edge cases. Therefore, simulations are necessary for their development and evaluation. However, existing open-source simulators often exhibit a significant gap betwe… ▽ More Modern autonomous systems require extensive testing to ensure reliability and build trust in ground vehicles. However, testing these systems in the real-world is challenging due to the lack of large and diverse datasets, especially in edge cases. Therefore, simulations are necessary for their development and evaluation. However, existing open-source simulators often exhibit a significant gap between synthetic and real-world domains, leading to deteriorated mobility performance and reduced platform reliability when using simulation data. To address this issue, our Sco** Autonomous Vehicle Simulation (SAVeS) platform benchmarks the performance of simulated environments for autonomous ground vehicle testing between synthetic and real-world domains. Our platform aims to quantify the domain gap and enable researchers to develop and test autonomous systems in a controlled environment. Additionally, we propose using domain adaptation technologies to address the domain gap between synthetic and real-world data with our SAVeS$^+$ extension. Our results demonstrate that SAVeS$^+$ is effective in hel** to close the gap between synthetic and real-world domains and yields comparable performance for models trained with processed synthetic datasets to those trained on real-world datasets of same scale. This paper highlights our efforts to quantify and address the domain gap between synthetic and real-world data for autonomy simulation. By enabling researchers to develop and test autonomous systems in a controlled environment, we hope to bring autonomy simulation one step closer to realization. △ Less

Submitted 5 June, 2023; originally announced June 2023.

arXiv:2306.01704 [pdf, other]

Temporal-controlled Frame Swap for Generating High-Fidelity Stereo Driving Data for Autonomy Analysis

Authors: Yedi Luo, Xiangyu Bai, Le Jiang, Aniket Gupta, Eric Mortin, Hanumant Singh, Sarah Ostadabbas

Abstract: This paper presents a novel approach, TeFS (Temporal-controlled Frame Swap), to generate synthetic stereo driving data for visual simultaneous localization and map** (vSLAM) tasks. TeFS is designed to overcome the lack of native stereo vision support in commercial driving simulators, and we demonstrate its effectiveness using Grand Theft Auto V (GTA V), a high-budget open-world video game engine… ▽ More This paper presents a novel approach, TeFS (Temporal-controlled Frame Swap), to generate synthetic stereo driving data for visual simultaneous localization and map** (vSLAM) tasks. TeFS is designed to overcome the lack of native stereo vision support in commercial driving simulators, and we demonstrate its effectiveness using Grand Theft Auto V (GTA V), a high-budget open-world video game engine. We introduce GTAV-TeFS, the first large-scale GTA V stereo-driving dataset, containing over 88,000 high-resolution stereo RGB image pairs, along with temporal information, GPS coordinates, camera poses, and full-resolution dense depth maps. GTAV-TeFS offers several advantages over other synthetic stereo datasets and enables the evaluation and enhancement of state-of-the-art stereo vSLAM models under GTA V's environment. We validate the quality of the stereo data collected using TeFS by conducting a comparative analysis with the conventional dual-viewport data using an open-source simulator. We also benchmark various vSLAM models using the challenging-case comparison groups included in GTAV-TeFS, revealing the distinct advantages and limitations inherent to each model. The goal of our work is to bring more high-fidelity stereo data from commercial-grade game simulators into the research domain and push the boundary of vSLAM models. △ Less

Submitted 25 December, 2023; v1 submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.17845 [pdf, other]

SPAC-Net: Synthetic Pose-aware Animal ControlNet for Enhanced Pose Estimation

Authors: Le Jiang, Sarah Ostadabbas

Abstract: Animal pose estimation has become a crucial area of research, but the scarcity of annotated data is a significant challenge in develo** accurate models. Synthetic data has emerged as a promising alternative, but it frequently exhibits domain discrepancies with real data. Style transfer algorithms have been proposed to address this issue, but they suffer from insufficient spatial correspondence,… ▽ More Animal pose estimation has become a crucial area of research, but the scarcity of annotated data is a significant challenge in develo** accurate models. Synthetic data has emerged as a promising alternative, but it frequently exhibits domain discrepancies with real data. Style transfer algorithms have been proposed to address this issue, but they suffer from insufficient spatial correspondence, leading to the loss of label information. In this work, we present a new approach called Synthetic Pose-aware Animal ControlNet (SPAC-Net), which incorporates ControlNet into the previously proposed Prior-Aware Synthetic animal data generation (PASyn) pipeline. We leverage the plausible pose data generated by the Variational Auto-Encoder (VAE)-based data generation pipeline as input for the ControlNet Holistically-nested Edge Detection (HED) boundary task model to generate synthetic data with pose labels that are closer to real data, making it possible to train a high-precision pose estimation network without the need for real data. In addition, we propose the Bi-ControlNet structure to separately detect the HED boundary of animals and backgrounds, improving the precision and stability of the generated data. Using the SPAC-Net pipeline, we generate synthetic zebra and rhino images and test them on the AP10K real dataset, demonstrating superior performance compared to using only real images or synthetic data generated by other methods. Our work demonstrates the potential for synthetic data to overcome the challenge of limited annotated data in animal pose estimation. △ Less

Submitted 31 May, 2023; v1 submitted 28 May, 2023; originally announced May 2023.

Comments: arXiv admin note: text overlap with arXiv:2208.13944

arXiv:2303.16867 [pdf, other]

A Video-based End-to-end Pipeline for Non-nutritive Sucking Action Recognition and Segmentation in Young Infants

Authors: Shaotong Zhu, Michael Wan, Elaheh Hatamimajoumerd, Kashish Jain, Samuel Zlota, Cholpady Vikram Kamath, Cassandra B. Rowan, Emma C. Grace, Matthew S. Goodwin, Marie J. Hayes, Rebecca A. Schwartz-Mette, Emily Zimmerman, Sarah Ostadabbas

Abstract: We present an end-to-end computer vision pipeline to detect non-nutritive sucking (NNS) -- an infant sucking pattern with no nutrition delivered -- as a potential biomarker for developmental delays, using off-the-shelf baby monitor video footage. One barrier to clinical (or algorithmic) assessment of NNS stems from its sparsity, requiring experts to wade through hours of footage to find minutes of… ▽ More We present an end-to-end computer vision pipeline to detect non-nutritive sucking (NNS) -- an infant sucking pattern with no nutrition delivered -- as a potential biomarker for developmental delays, using off-the-shelf baby monitor video footage. One barrier to clinical (or algorithmic) assessment of NNS stems from its sparsity, requiring experts to wade through hours of footage to find minutes of relevant activity. Our NNS activity segmentation algorithm solves this problem by identifying periods of NNS with high certainty -- up to 94.0\% average precision and 84.9\% average recall across 30 heterogeneous 60 s clips, drawn from our manually annotated NNS clinical in-crib dataset of 183 hours of overnight baby monitor footage from 19 infants. Our method is based on an underlying NNS action recognition algorithm, which uses spatiotemporal deep learning networks and infant-specific pose estimation, achieving 94.9\% accuracy in binary classification of 960 2.5 s balanced NNS vs. non-NNS clips. Tested on our second, independent, and public NNS in-the-wild dataset, NNS recognition classification reaches 92.3\% accuracy, and NNS segmentation achieves 90.8\% precision and 84.2\% recall. △ Less

Submitted 29 March, 2023; originally announced March 2023.

arXiv:2210.15022 [pdf, other]

Automatic Assessment of Infant Face and Upper-Body Symmetry as Early Signs of Torticollis

Authors: Michael Wan, Xiaofei Huang, Bethany Tunik, Sarah Ostadabbas

Abstract: We apply computer vision pose estimation techniques developed expressly for the data-scarce infant domain to the study of torticollis, a common condition in infants for which early identification and treatment is critical. Specifically, we use a combination of facial landmark and body joint estimation techniques designed for infants to estimate a range of geometric measures pertaining to face and… ▽ More We apply computer vision pose estimation techniques developed expressly for the data-scarce infant domain to the study of torticollis, a common condition in infants for which early identification and treatment is critical. Specifically, we use a combination of facial landmark and body joint estimation techniques designed for infants to estimate a range of geometric measures pertaining to face and upper body symmetry, drawn from an array of sources in the physical therapy and ophthalmology research literature in torticollis. We gauge performance with a range of metrics and show that the estimates of most these geometric measures are successful, yielding strong to very strong Spearman's $ρ$ correlation with ground truth values. Furthermore, we show that these estimates, derived from pose estimation neural networks designed for the infant domain, cleanly outperform estimates derived from more widely known networks designed for the adult domain △ Less

Submitted 7 November, 2022; v1 submitted 26 October, 2022; originally announced October 2022.

arXiv:2208.13944 [pdf, other]

Prior-Aware Synthetic Data to the Rescue: Animal Pose Estimation with Very Limited Real Data

Authors: Le Jiang, Shuangjun Liu, Xiangyu Bai, Sarah Ostadabbas

Abstract: Accurately annotated image datasets are essential components for studying animal behaviors from their poses. Compared to the number of species we know and may exist, the existing labeled pose datasets cover only a small portion of them, while building comprehensive large-scale datasets is prohibitively expensive. Here, we present a very data efficient strategy targeted for pose estimation in quadr… ▽ More Accurately annotated image datasets are essential components for studying animal behaviors from their poses. Compared to the number of species we know and may exist, the existing labeled pose datasets cover only a small portion of them, while building comprehensive large-scale datasets is prohibitively expensive. Here, we present a very data efficient strategy targeted for pose estimation in quadrupeds that requires only a small amount of real images from the target animal. It is confirmed that fine-tuning a backbone network with pretrained weights on generic image datasets such as ImageNet can mitigate the high demand for target animal pose data and shorten the training time by learning the the prior knowledge of object segmentation and keypoint estimation in advance. However, when faced with serious data scarcity (i.e., $<10^2$ real images), the model performance stays unsatisfactory, particularly for limbs with considerable flexibility and several comparable parts. We therefore introduce a prior-aware synthetic animal data generation pipeline called PASyn to augment the animal pose data essential for robust pose estimation. PASyn generates a probabilistically-valid synthetic pose dataset, SynAP, through training a variational generative model on several animated 3D animal models. In addition, a style transfer strategy is utilized to blend the synthetic animal image into the real backgrounds. We evaluate the improvement made by our approach with three popular backbone networks and test their pose estimation accuracy on publicly available animal pose images as well as collected from real animals in a zoo. △ Less

Submitted 29 August, 2022; originally announced August 2022.

arXiv:2207.12537 [pdf, other]

Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation

Authors: Zhou** Wang, Sarah Ostadabbas

Abstract: 3D Human body pose and shape estimation within a temporal sequence can be quite critical for understanding human behavior. Despite the significant progress in human pose estimation in the recent years, which are often based on single images or videos, human motion estimation on live stream videos is still a rarely-touched area considering its special requirements for real-time output and temporal… ▽ More 3D Human body pose and shape estimation within a temporal sequence can be quite critical for understanding human behavior. Despite the significant progress in human pose estimation in the recent years, which are often based on single images or videos, human motion estimation on live stream videos is still a rarely-touched area considering its special requirements for real-time output and temporal consistency. To address this problem, we present a temporally embedded 3D human body pose and shape estimation (TePose) method to improve the accuracy and temporal consistency of pose estimation in live stream videos. TePose uses previous predictions as a bridge to feedback the error for better estimation in the current frame and to learn the correspondence between data frames and predictions in the history. A multi-scale spatio-temporal graph convolutional network is presented as the motion discriminator for adversarial training using datasets without any 3D labeling. We propose a sequential data loading strategy to meet the special start-to-end data processing requirement of live stream. We demonstrate the importance of each proposed module with extensive experiments. The results show the effectiveness of TePose on widely-used human pose benchmarks with state-of-the-art performance. △ Less

Submitted 25 July, 2022; originally announced July 2022.

arXiv:2207.09352 [pdf, other]

Computer Vision to the Rescue: Infant Postural Symmetry Estimation from Incongruent Annotations

Authors: Xiaofei Huang, Michael Wan, Lingfei Luan, Bethany Tunik, Sarah Ostadabbas

Abstract: Bilateral postural symmetry plays a key role as a potential risk marker for autism spectrum disorder (ASD) and as a symptom of congenital muscular torticollis (CMT) in infants, but current methods of assessing symmetry require laborious clinical expert assessments. In this paper, we develop a computer vision based infant symmetry assessment system, leveraging 3D human pose estimation for infants.… ▽ More Bilateral postural symmetry plays a key role as a potential risk marker for autism spectrum disorder (ASD) and as a symptom of congenital muscular torticollis (CMT) in infants, but current methods of assessing symmetry require laborious clinical expert assessments. In this paper, we develop a computer vision based infant symmetry assessment system, leveraging 3D human pose estimation for infants. Evaluation and calibration of our system against ground truth assessments is complicated by our findings from a survey of human ratings of angle and symmetry, that such ratings exhibit low inter-rater reliability. To rectify this, we develop a Bayesian estimator of the ground truth derived from a probabilistic graphical model of fallible human raters. We show that the 3D infant pose estimation model can achieve 68% area under the receiver operating characteristic curve performance in predicting the Bayesian aggregate labels, compared to only 61% from a 2D infant pose estimation model and 60% from a 3D adult pose estimation model, highlighting the importance of 3D poses and infant domain knowledge in assessing infant body symmetry. Our survey analysis also suggests that human ratings are susceptible to higher levels of bias and inconsistency, and hence our final 3D pose-based symmetry assessment system is calibrated but not directly supervised by Bayesian aggregate human ratings, yielding higher levels of consistency and lower levels of inter-limb assessment bias. △ Less

Submitted 19 July, 2022; originally announced July 2022.

arXiv:2201.11828 [pdf, other]

Pressure Eye: In-bed Contact Pressure Estimation via Contact-less Imaging

Authors: Shuangjun Liu, Sarah Ostadabbas

Abstract: Computer vision has achieved great success in interpreting semantic meanings from images, yet estimating underlying (non-visual) physical properties of an object is often limited to their bulk values rather than reconstructing a dense map. In this work, we present our pressure eye (PEye) approach to estimate contact pressure between a human body and the surface she is lying on with high resolution… ▽ More Computer vision has achieved great success in interpreting semantic meanings from images, yet estimating underlying (non-visual) physical properties of an object is often limited to their bulk values rather than reconstructing a dense map. In this work, we present our pressure eye (PEye) approach to estimate contact pressure between a human body and the surface she is lying on with high resolution from vision signals directly. PEye approach could ultimately enable the prediction and early detection of pressure ulcers in bed-bound patients, that currently depends on the use of expensive pressure mats. Our PEye network is configured in a dual encoding shared decoding form to fuse visual cues and some relevant physical parameters in order to reconstruct high resolution pressure maps (PMs). We also present a pixel-wise resampling approach based on Naive Bayes assumption to further enhance the PM regression performance. A percentage of correct sensing (PCS) tailored for sensing estimation accuracy evaluation is also proposed which provides another perspective for performance evaluation under varying error tolerances. We tested our approach via a series of extensive experiments using multimodal sensing technologies to collect data from 102 subjects while lying on a bed. The individual's high resolution contact pressure data could be estimated from their RGB or long wavelength infrared (LWIR) images with 91.8% and 91.2% estimation accuracies in $PCS_{efs0.1}$ criteria, superior to state-of-the-art methods in the related image regression/translation tasks. △ Less

Submitted 27 January, 2022; originally announced January 2022.

arXiv:2110.08935 [pdf, other]

InfAnFace: Bridging the infant-adult domain gap in facial landmark estimation in the wild

Authors: Michael Wan, Shaotong Zhu, Lingfei Luan, Gulati Prateek, Xiaofei Huang, Rebecca Schwartz-Mette, Marie Hayes, Emily Zimmerman, Sarah Ostadabbas

Abstract: We lay the groundwork for research in the algorithmic comprehension of infant faces, in anticipation of applications from healthcare to psychology, especially in the early prediction of developmental disorders. Specifically, we introduce the first-ever dataset of infant faces annotated with facial landmark coordinates and pose attributes, demonstrate the inadequacies of existing facial landmark es… ▽ More We lay the groundwork for research in the algorithmic comprehension of infant faces, in anticipation of applications from healthcare to psychology, especially in the early prediction of developmental disorders. Specifically, we introduce the first-ever dataset of infant faces annotated with facial landmark coordinates and pose attributes, demonstrate the inadequacies of existing facial landmark estimation algorithms in the infant domain, and train new state-of-the-art models that significantly improve upon those algorithms using domain adaptation techniques. We touch on the closely related task of facial detection for infants, and also on a challenging case study of infrared baby monitor images gathered by our lab as part of in-field research into the aforementioned developmental issues. △ Less

Submitted 26 May, 2022; v1 submitted 17 October, 2021; originally announced October 2021.

arXiv:2110.06877 [pdf, other]

A Review on Human Pose Estimation

Authors: Rohit Josyula, Sarah Ostadabbas

Abstract: The phenomenon of Human Pose Estimation (HPE) is a problem that has been explored over the years, particularly in computer vision. But what exactly is it? To answer this, the concept of a pose must first be understood. Pose can be defined as the arrangement of human joints in a specific manner. Therefore, we can define the problem of Human Pose Estimation as the localization of human joints or pre… ▽ More The phenomenon of Human Pose Estimation (HPE) is a problem that has been explored over the years, particularly in computer vision. But what exactly is it? To answer this, the concept of a pose must first be understood. Pose can be defined as the arrangement of human joints in a specific manner. Therefore, we can define the problem of Human Pose Estimation as the localization of human joints or predefined landmarks in images and videos. There are several types of pose estimation, including body, face, and hand, as well as many aspects to it. This paper will cover them, starting with the classical approaches to HPE to the Deep Learning based models. △ Less

Submitted 13 October, 2021; originally announced October 2021.

arXiv:2108.10360 [pdf, other]

doi 10.1007/s11263-022-01603-x

Interpreting Face Inference Models using Hierarchical Network Dissection

Authors: Divyang Teotia, Agata Lapedriza, Sarah Ostadabbas

Abstract: This paper presents Hierarchical Network Dissection, a general pipeline to interpret the internal representation of face-centric inference models. Using a probabilistic formulation, our pipeline pairs units of the model with concepts in our "Face Dictionary", a collection of facial concepts with corresponding sample images. Our pipeline is inspired by Network Dissection, a popular interpretability… ▽ More This paper presents Hierarchical Network Dissection, a general pipeline to interpret the internal representation of face-centric inference models. Using a probabilistic formulation, our pipeline pairs units of the model with concepts in our "Face Dictionary", a collection of facial concepts with corresponding sample images. Our pipeline is inspired by Network Dissection, a popular interpretability model for object-centric and scene-centric models. However, our formulation allows to deal with two important challenges of face-centric models that Network Dissection cannot address: (1) spacial overlap of concepts: there are different facial concepts that simultaneously occur in the same region of the image, like "nose" (facial part) and "pointy nose" (facial attribute); and (2) global concepts: there are units with affinity to concepts that do not refer to specific locations of the face (e.g. apparent age). We use Hierarchical Network Dissection to dissect different face-centric inference models trained on widely-used facial datasets. The results show models trained for different tasks learned different internal representations. Furthermore, the interpretability results can reveal some biases in the training data and some interesting characteristics of the face-centric inference tasks. Finally, we conduct controlled experiments on biased data to showcase the potential of Hierarchical Network Dissection for bias discovery. The results illustrate how Hierarchical Network Dissection can be used to discover and quantify bias in the training data that is also encoded in the model. △ Less

Submitted 28 March, 2022; v1 submitted 23 August, 2021; originally announced August 2021.

Journal ref: International Journal of Computer Vision (2022)

arXiv:2106.10393 [pdf, other]

Dynamical Deep Generative Latent Modeling of 3D Skeletal Motion

Authors: Amirreza Farnoosh, Sarah Ostadabbas

Abstract: In this paper, we propose a Bayesian switching dynamical model for segmentation of 3D pose data over time that uncovers interpretable patterns in the data and is generative. Our model decomposes highly correlated skeleton data into a set of few spatial basis of switching temporal processes in a low-dimensional latent framework. We parameterize these temporal processes with regard to a switching de… ▽ More In this paper, we propose a Bayesian switching dynamical model for segmentation of 3D pose data over time that uncovers interpretable patterns in the data and is generative. Our model decomposes highly correlated skeleton data into a set of few spatial basis of switching temporal processes in a low-dimensional latent framework. We parameterize these temporal processes with regard to a switching deep vector autoregressive prior in order to accommodate both multimodal and higher-order nonlinear inter-dependencies. This results in a dynamical deep generative latent model that parses the meaningful intrinsic states in the dynamics of 3D pose data using approximate variational inference, and enables a realistic low-level dynamical generation and segmentation of complex skeleton movements. Our experiments on four biological motion data containing bat flight, salsa dance, walking, and golf datasets substantiate superior performance of our model in comparison with the state-of-the-art methods. △ Less

Submitted 18 June, 2021; originally announced June 2021.

arXiv:2105.10996 [pdf, other]

Heuristic Weakly Supervised 3D Human Pose Estimation

Authors: Shuangjun Liu, Michael Wan, Sarah Ostadabbas

Abstract: Monocular 3D human pose estimation from RGB images has attracted significant attention in recent years. However, recent models depend on supervised training with 3D pose ground truth data or known pose priors for their target domains. 3D pose data is typically collected with motion capture devices, severely limiting their applicability. In this paper, we present a heuristic weakly supervised 3D hu… ▽ More Monocular 3D human pose estimation from RGB images has attracted significant attention in recent years. However, recent models depend on supervised training with 3D pose ground truth data or known pose priors for their target domains. 3D pose data is typically collected with motion capture devices, severely limiting their applicability. In this paper, we present a heuristic weakly supervised 3D human pose (HW-HuP) solution to estimate 3D poses in when no ground truth 3D pose data is available. HW-HuP learns partial pose priors from 3D human pose datasets and uses easy-to-access observations from the target domain to estimate 3D human pose and shape in an optimization and regression cycle. We employ depth data for weak supervision during training, but not inference. We show that HW-HuP meaningfully improves upon state-of-the-art models in two practical settings where 3D pose data can hardly be obtained: human poses in bed, and infant poses in the wild. Furthermore, we show that HW-HuP retains comparable performance to cutting-edge models on public benchmarks, even when such models train on 3D pose data. △ Less

Submitted 12 May, 2023; v1 submitted 23 May, 2021; originally announced May 2021.

arXiv:2105.10837 [pdf, other]

Adapted Human Pose: Monocular 3D Human Pose Estimation with Zero Real 3D Pose Data

Authors: Shuangjun Liu, Naveen Sehgal, Sarah Ostadabbas

Abstract: The ultimate goal for an inference model is to be robust and functional in real life applications. However, training vs. test data domain gaps often negatively affect model performance. This issue is especially critical for the monocular 3D human pose estimation problem, in which 3D human data is often collected in a controlled lab setting. In this paper, we focus on alleviating the negative effec… ▽ More The ultimate goal for an inference model is to be robust and functional in real life applications. However, training vs. test data domain gaps often negatively affect model performance. This issue is especially critical for the monocular 3D human pose estimation problem, in which 3D human data is often collected in a controlled lab setting. In this paper, we focus on alleviating the negative effect of domain shift in both appearance and pose space for 3D human pose estimation by presenting our adapted human pose (AHuP) approach. AHuP is built upon two key components: (1) semantically aware adaptation (SAA) for the cross-domain feature space adaptation, and (2) skeletal pose adaptation (SPA) for the pose space adaptation which takes only limited information from the target domain. By using zero real 3D human pose data, one of our adapted synthetic models shows comparable performance with the SOTA pose estimation models trained with large scale real 3D human datasets. The proposed SPA can be also employed independently as a light-weighted head to improve existing SOTA models in a novel context. A new 3D scan-based synthetic human dataset called ScanAva+ is also going to be publicly released with this work. △ Less

Submitted 22 January, 2022; v1 submitted 22 May, 2021; originally announced May 2021.

arXiv:2010.06100 [pdf, other]

Invariant Representation Learning for Infant Pose Estimation with Small Data

Authors: Xiaofei Huang, Nihang Fu, Shuangjun Liu, Sarah Ostadabbas

Abstract: Infant motion analysis is a topic with critical importance in early childhood development studies. However, while the applications of human pose estimation have become more and more broad, models trained on large-scale adult pose datasets are barely successful in estimating infant poses due to the significant differences in their body ratio and the versatility of their poses. Moreover, the privacy… ▽ More Infant motion analysis is a topic with critical importance in early childhood development studies. However, while the applications of human pose estimation have become more and more broad, models trained on large-scale adult pose datasets are barely successful in estimating infant poses due to the significant differences in their body ratio and the versatility of their poses. Moreover, the privacy and security considerations hinder the availability of adequate infant pose data required for training of a robust model from scratch. To address this problem, this paper presents (1) building and publicly releasing a hybrid synthetic and real infant pose (SyRIP) dataset with small yet diverse real infant images as well as generated synthetic infant poses and (2) a multi-stage invariant representation learning strategy that could transfer the knowledge from the adjacent domains of adult poses and synthetic infant images into our fine-tuned domain-adapted infant pose (FiDIP) estimation model. In our ablation study, with identical network structure, models trained on SyRIP dataset show noticeable improvement over the ones trained on the only other public infant pose datasets. Integrated with pose estimation backbone networks with varying complexity, FiDIP performs consistently better than the fine-tuned versions of those models. One of our best infant pose estimation performers on the state-of-the-art DarkPose model shows mean average precision (mAP) of 93.6. △ Less

Submitted 1 November, 2021; v1 submitted 12 October, 2020; originally announced October 2020.

arXiv:2009.05135 [pdf, other]

Deep Switching Auto-Regressive Factorization:Application to Time Series Forecasting

Authors: Amirreza Farnoosh, Bahar Azari, Sarah Ostadabbas

Abstract: We introduce deep switching auto-regressive factorization (DSARF), a deep generative model for spatio-temporal data with the capability to unravel recurring patterns in the data and perform robust short- and long-term predictions. Similar to other factor analysis methods, DSARF approximates high dimensional data by a product between time dependent weights and spatially dependent factors. These wei… ▽ More We introduce deep switching auto-regressive factorization (DSARF), a deep generative model for spatio-temporal data with the capability to unravel recurring patterns in the data and perform robust short- and long-term predictions. Similar to other factor analysis methods, DSARF approximates high dimensional data by a product between time dependent weights and spatially dependent factors. These weights and factors are in turn represented in terms of lower dimensional latent variables that are inferred using stochastic variational inference. DSARF is different from the state-of-the-art techniques in that it parameterizes the weights in terms of a deep switching vector auto-regressive likelihood governed with a Markovian prior, which is able to capture the non-linear inter-dependencies among weights to characterize multimodal temporal dynamics. This results in a flexible hierarchical deep generative factor analysis model that can be extended to (i) provide a collection of potentially interpretable states abstracted from the process dynamics, and (ii) perform short- and long-term vector time series prediction in a complex multi-relational setting. Our extensive experiments, which include simulated data and real data from a wide range of applications such as climate change, weather forecasting, traffic, infectious disease spread and nonlinear physical systems attest the superior performance of DSARF in terms of long- and short-term prediction error, when compared with the state-of-the-art methods. △ Less

Submitted 10 September, 2020; originally announced September 2020.

arXiv:2008.08735 [pdf, other]

Simultaneously-Collected Multimodal Lying Pose Dataset: Towards In-Bed Human Pose Monitoring under Adverse Vision Conditions

Authors: Shuangjun Liu, Xiaofei Huang, Nihang Fu, Cheng Li, Zhongnan Su, Sarah Ostadabbas

Abstract: Computer vision (CV) has achieved great success in interpreting semantic meanings from images, yet CV algorithms can be brittle for tasks with adverse vision conditions and the ones suffering from data/label pair limitation. One of this tasks is in-bed human pose estimation, which has significant values in many healthcare applications. In-bed pose monitoring in natural settings could involve compl… ▽ More Computer vision (CV) has achieved great success in interpreting semantic meanings from images, yet CV algorithms can be brittle for tasks with adverse vision conditions and the ones suffering from data/label pair limitation. One of this tasks is in-bed human pose estimation, which has significant values in many healthcare applications. In-bed pose monitoring in natural settings could involve complete darkness or full occlusion. Furthermore, the lack of publicly available in-bed pose datasets hinders the use of many successful pose estimation algorithms for this task. In this paper, we introduce our Simultaneously-collected multimodal Lying Pose (SLP) dataset, which includes in-bed pose images from 109 participants captured using multiple imaging modalities including RGB, long wave infrared, depth, and pressure map. We also present a physical hyper parameter tuning strategy for ground truth pose label generation under extreme conditions such as lights off and being fully covered by a sheet/blanket. SLP design is compatible with the mainstream human pose datasets, therefore, the state-of-the-art 2D pose estimation models can be trained effectively with SLP data with promising performance as high as 95% at [email protected] on a single modality. The pose estimation performance can be further improved by including additional modalities through collaboration. △ Less

Submitted 19 August, 2020; originally announced August 2020.

arXiv:2003.09779 [pdf, other]

Deep Markov Spatio-Temporal Factorization

Authors: Amirreza Farnoosh, Behnaz Rezaei, Eli Zachary Sennesh, Zulqarnain Khan, Jennifer Dy, Ajay Satpute, J Benjamin Hutchinson, Jan-Willem van de Meent, Sarah Ostadabbas

Abstract: We introduce deep Markov spatio-temporal factorization (DMSTF), a generative model for dynamical analysis of spatio-temporal data. Like other factor analysis methods, DMSTF approximates high dimensional data by a product between time dependent weights and spatially dependent factors. These weights and factors are in turn represented in terms of lower dimensional latents inferred using stochastic v… ▽ More We introduce deep Markov spatio-temporal factorization (DMSTF), a generative model for dynamical analysis of spatio-temporal data. Like other factor analysis methods, DMSTF approximates high dimensional data by a product between time dependent weights and spatially dependent factors. These weights and factors are in turn represented in terms of lower dimensional latents inferred using stochastic variational inference. The innovation in DMSTF is that we parameterize weights in terms of a deep Markovian prior extendable with a discrete latent, which is able to characterize nonlinear multimodal temporal dynamics, and perform multidimensional time series forecasting. DMSTF learns a low dimensional spatial latent to generatively parameterize spatial factors or their functional forms in order to accommodate high spatial dimensionality. We parameterize the corresponding variational distribution using a bidirectional recurrent network in the low-level latent representations. This results in a flexible family of hierarchical deep generative factor analysis models that can be extended to perform time series clustering or perform factor analysis in the presence of a control signal. Our experiments, which include simulated and real-world data, demonstrate that DMSTF outperforms related methodologies in terms of predictive performance for unseen data, reveals meaningful clusters in the data, and performs forecasting in a variety of domains with potentially nonlinear temporal transitions. △ Less

Submitted 18 August, 2020; v1 submitted 21 March, 2020; originally announced March 2020.

arXiv:2003.07335 [pdf, other]

G-LBM:Generative Low-dimensional Background Model Estimation from Video Sequences

Authors: Behnaz Rezaei, Amirreza Farnoosh, Sarah Ostadabbas

Abstract: In this paper, we propose a computationally tractable and theoretically supported non-linear low-dimensional generative model to represent real-world data in the presence of noise and sparse outliers. The non-linear low-dimensional manifold discovery of data is done through describing a joint distribution over observations, and their low-dimensional representations (i.e. manifold coordinates). Our… ▽ More In this paper, we propose a computationally tractable and theoretically supported non-linear low-dimensional generative model to represent real-world data in the presence of noise and sparse outliers. The non-linear low-dimensional manifold discovery of data is done through describing a joint distribution over observations, and their low-dimensional representations (i.e. manifold coordinates). Our model, called generative low-dimensional background model (G-LBM) admits variational operations on the distribution of the manifold coordinates and simultaneously generates a low-rank structure of the latent manifold given the data. Therefore, our probabilistic model contains the intuition of the non-probabilistic low-dimensional manifold learning. G-LBM selects the intrinsic dimensionality of the underling manifold of the observations, and its probabilistic nature models the noise in the observation data. G-LBM has direct application in the background scenes model estimation from video sequences and we have evaluated its performance on SBMnet-2016 and BMC2012 datasets, where it achieved a performance higher or comparable to other state-of-the-art methods while being agnostic to the background scenes in videos. Besides, in challenges such as camera jitter and background motion, G-LBM is able to robustly estimate the background by effectively modeling the uncertainties in video observations in these scenarios. △ Less

Submitted 17 July, 2020; v1 submitted 16 March, 2020; originally announced March 2020.

arXiv:1912.11751 [pdf, other]

doi 10.1088/2632-2153/ab8967

Development of Use-specific High Performance Cyber-Nanomaterial Optical Detectors by Effective Choice of Machine Learning Algorithms

Authors: Davoud Hejazi, Shuangjun Liu, Amirreza Farnoosh, Sarah Ostadabbas, Swastik Kar

Abstract: Due to their inherent variabilities,nanomaterial-based sensors are challenging to translate into real-world applications,where reliability/reproducibility is key.Recently we showed Bayesian inference can be employed on engineered variability in layered nanomaterial-based optical transmission filters to determine optical wavelengths with high accuracy/precision.In many practical applications the se… ▽ More Due to their inherent variabilities,nanomaterial-based sensors are challenging to translate into real-world applications,where reliability/reproducibility is key.Recently we showed Bayesian inference can be employed on engineered variability in layered nanomaterial-based optical transmission filters to determine optical wavelengths with high accuracy/precision.In many practical applications the sensing cost/speed and long-term reliability can be equal or more important considerations.Though various machine learning tools are frequently used on sensor/detector networks to address these,nonetheless their effectiveness on nanomaterial-based sensors has not been explored.Here we show the best choice of ML algorithm in a cyber-nanomaterial detector is mainly determined by specific use considerations,e.g.,accuracy, computational cost,speed, and resilience against drifts/ageing effects.When sufficient data/computing resources are provided,highest sensing accuracy can be achieved by the kNN and Bayesian inference algorithms,but but can be computationally expensive for real-time applications.In contrast,artificial neural networks are computationally expensive to train,but provide the fastest result under testing conditions and remain reasonably accurate.When data is limited,SVMs perform well even with small training sets,while other algorithms show considerable reduction in accuracy if data is scarce,hence,setting a lower limit on the size of required training data.We show by tracking/modeling the long-term drifts of the detector performance over large (1year) period,it is possible to improve the predictive accuracy with no need for recalibration.Our research shows for the first time if the ML algorithm is chosen specific to use-case,low-cost solution-processed cyber-nanomaterial detectors can be practically implemented under diverse operational requirements,despite their inherent variabilities. △ Less

Submitted 3 January, 2020; v1 submitted 25 December, 2019; originally announced December 2019.

Comments: 34 pages combined with images and references, 5 figures, added 1 table of content graphics image at the beginning of article, fixed the typo in title

arXiv:1909.09566 [pdf, other]

Target-Specific Action Classification for Automated Assessment of Human Motor Behavior from Video

Authors: Behnaz Rezaei, Yiorgos Christakis, Bryan Ho, Kevin Thomas, Kelley Erb, Sarah Ostadabbas, Shyamal Patel

Abstract: Objective monitoring and assessment of human motor behavior can improve the diagnosis and management of several medical conditions. Over the past decade, significant advances have been made in the use of wearable technology for continuously monitoring human motor behavior in free-living conditions. However, wearable technology remains ill-suited for applications which require monitoring and interp… ▽ More Objective monitoring and assessment of human motor behavior can improve the diagnosis and management of several medical conditions. Over the past decade, significant advances have been made in the use of wearable technology for continuously monitoring human motor behavior in free-living conditions. However, wearable technology remains ill-suited for applications which require monitoring and interpretation of complex motor behaviors (e.g. involving interactions with the environment). Recent advances in computer vision and deep learning have opened up new possibilities for extracting information from video recordings. In this paper, we present a hierarchical vision-based behavior phenoty** method for classification of basic human actions in video recordings performed using a single RGB camera. Our method addresses challenges associated with tracking multiple human actors and classification of actions in videos recorded in changing environments with different fields of view. We implement a cascaded pose tracker that uses temporal relationships between detections for short-term tracking and appearance-based tracklet fusion for long-term tracking. Furthermore, for action classification, we use pose evolution maps derived from the cascaded pose tracker as low-dimensional and interpretable representations of the movement sequences for training a convolutional neural network. The cascaded pose tracker achieves an average accuracy of 88\% in tracking the target human actor in our video recordings, and overall system achieves average test accuracy of 84\% for target-specific action classification in untrimmed video recordings. △ Less

Submitted 20 September, 2019; originally announced September 2019.

Comments: This manuscript is under submission to the Sensors journal

arXiv:1907.02161 [pdf, other]

Seeing Under the Cover: A Physics Guided Learning Approach for In-Bed Pose Estimation

Authors: Shuangjun Liu, Sarah Ostadabbas

Abstract: Human in-bed pose estimation has huge practical values in medical and healthcare applications yet still mainly relies on expensive pressure map** (PM) solutions. In this paper, we introduce our novel physics inspired vision-based approach that addresses the challenging issues associated with the in-bed pose estimation problem including monitoring a fully covered person in complete darkness. We r… ▽ More Human in-bed pose estimation has huge practical values in medical and healthcare applications yet still mainly relies on expensive pressure map** (PM) solutions. In this paper, we introduce our novel physics inspired vision-based approach that addresses the challenging issues associated with the in-bed pose estimation problem including monitoring a fully covered person in complete darkness. We reformulated this problem using our proposed Under the Cover Imaging via Thermal Diffusion (UCITD) method to capture the high resolution pose information of the body even when it is fully covered by using a long wavelength IR technique. We proposed a physical hyperparameter concept through which we achieved high quality groundtruth pose labels in different modalities. A fully annotated in-bed pose dataset called Simultaneously-collected multimodal Lying Pose (SLP) is also formed/released with the same order of magnitude as most existing large-scale human pose datasets to support complex models' training and evaluation. A network trained from scratch on it and tested on two diverse settings, one in a living room and the other in a hospital room showed pose estimation performance of 99.5% and 95.7% in PCK0.2 standard, respectively. Moreover, in a multi-factor comparison with a state-of-the art in-bed pose monitoring solution based on PM, our solution showed significant superiority in all practical aspects by being 60 times cheaper, 300 times smaller, while having higher pose recognition granularity and accuracy. △ Less

Submitted 20 September, 2019; v1 submitted 3 July, 2019; originally announced July 2019.

arXiv:1906.01821 [pdf, other]

Infant Contact-less Non-Nutritive Sucking Pattern Quantification via Facial Gesture Analysis

Authors: Xiaofei Huang, Alaina Martens, Emily Zimmerman, Sarah Ostadabbas

Abstract: Non-nutritive sucking (NNS) is defined as the sucking action that occurs when a finger, pacifier, or other object is placed in the baby's mouth, but there is no nutrient delivered. In addition to providing a sense of safety, NNS even can be regarded as an indicator of infant's central nervous system development. The rich data, such as sucking frequency, the number of cycles, and their amplitude du… ▽ More Non-nutritive sucking (NNS) is defined as the sucking action that occurs when a finger, pacifier, or other object is placed in the baby's mouth, but there is no nutrient delivered. In addition to providing a sense of safety, NNS even can be regarded as an indicator of infant's central nervous system development. The rich data, such as sucking frequency, the number of cycles, and their amplitude during baby's non-nutritive sucking is important clue for judging the brain development of infants or preterm infants. Nowadays most researchers are collecting NNS data by using some contact devices such as pressure transducers. However, such invasive contact will have a direct impact on the baby's natural sucking behavior, resulting in significant distortion in the collected data. Therefore, we propose a novel contact-less NNS data acquisition and quantification scheme, which leverages the facial landmarks tracking technology to extract the movement signals of baby's jaw from recorded baby's sucking video. Since completion of the sucking action requires a large amount of synchronous coordination and neural integration of the facial muscles and the cranial nerves, the facial muscle movement signals accompanying baby's sucking pacifier can indirectly replace the NNS signal. We have evaluated our method on videos collected from several infants during their NNS behaviors and we have achieved the quantified NNS patterns closely comparable to results from visual inspection as well as contact-based sensor readings. △ Less

Submitted 5 June, 2019; originally announced June 2019.

arXiv:1902.00820 [pdf, other]

DeepPBM: Deep Probabilistic Background Model Estimation from Video Sequences

Authors: Amirreza Farnoosh, Behnaz Rezaei, Sarah Ostadabbas

Abstract: This paper presents a novel unsupervised probabilistic model estimation of visual background in video sequences using a variational autoencoder framework. Due to the redundant nature of the backgrounds in surveillance videos, visual information of the background can be compressed into a low-dimensional subspace in the encoder part of the variational autoencoder, while the highly variant informatio… ▽ More This paper presents a novel unsupervised probabilistic model estimation of visual background in video sequences using a variational autoencoder framework. Due to the redundant nature of the backgrounds in surveillance videos, visual information of the background can be compressed into a low-dimensional subspace in the encoder part of the variational autoencoder, while the highly variant information of its moving foreground gets filtered throughout its encoding-decoding process. Our deep probabilistic background model (DeepPBM) estimation approach is enabled by the power of deep neural networks in learning compressed representations of video frames and reconstructing them back to the original domain. We evaluated the performance of our DeepPBM in background subtraction on 9 surveillance videos from the background model challenge (BMC2012) dataset, and compared that with a standard subspace learning technique, robust principle component analysis (RPCA), which similarly estimates a deterministic low dimensional representation of the background in videos and is widely used for this application. Our method outperforms RPCA on BMC2012 dataset with 23% in average in F-measure score, emphasizing that background subtraction using the trained model can be done in more than 10 times faster. △ Less

Submitted 2 February, 2019; originally announced February 2019.

arXiv:1901.09452 [pdf, other]

Bayesian Inference-enabled Precise Optical Wavelength Estimation using Transition Metal Dichalcogenide Thin Films

Authors: Davoud Hejazi, Shuangjun Liu, Sarah Ostadabbas, Swastik Kar

Abstract: Despite its ability to draw precise inferences from large and complex datasets, the use of data analytics in the field of condensed matter and materials sciences -- where vast quantities of complex metrology data are regularly generated -- has remained surprisingly limited. Specifically, such approaches could dramatically reduce the engineering complexities of devices that directly exploit the phy… ▽ More Despite its ability to draw precise inferences from large and complex datasets, the use of data analytics in the field of condensed matter and materials sciences -- where vast quantities of complex metrology data are regularly generated -- has remained surprisingly limited. Specifically, such approaches could dramatically reduce the engineering complexities of devices that directly exploit the physical properties of materials. Here, we present a cyber-physical system for accurately estimating the wavelength of any monochromatic light in the range of 325-1100nm, by applying Bayesian inference on the optical transmittance data from a few low-cost, easy-to-fabricate thin film "filters" of layered transition metal dichalcogenides (TMDs) such as MoS2 and WS2. Wavelengths of tested monochromatic light could be estimated with only 1% estimation error over 99% of the stated spectral range, with lowest error values reaching as low as a few ten parts per million (ppm) in a system with only eleven filters. By step-wise elimination of filters with the least contribution toward accuracy, mean estimation accuracy of 99% could be obtained even in a two-filter system. Furthermore, we provide a statistical approach for selecting the best "filter" material for any intended spectral range based on the spectral variation of transmittance within the desired range of wavelengths. And finally, we demonstrate that calibrating the data-driven models for the filters from time to time overcomes the minor drifts in their transmittance values, which allows using the same filters indefinitely. This work not only enables the development of simple cyber-physical photodetectors with high accuracy color-estimation, but also provides a framework for develo** similar cyber-physical systems with drastically reduced complexity. △ Less

Submitted 29 January, 2019; v1 submitted 27 January, 2019; originally announced January 2019.

arXiv:1811.07461 [pdf, ps, other]

Indoor GeoNet: Weakly Supervised Hybrid Learning for Depth and Pose Estimation

Authors: Amirreza Farnoosh, Sarah Ostadabbas

Abstract: Humans naturally perceive a 3D scene in front of them through accumulation of information obtained from multiple interconnected projections of the scene and by interpreting their correspondence. This phenomenon has inspired artificial intelligence models to extract the depth and view angle of the observed scene by modeling the correspondence between different views of that scene. Our paper is buil… ▽ More Humans naturally perceive a 3D scene in front of them through accumulation of information obtained from multiple interconnected projections of the scene and by interpreting their correspondence. This phenomenon has inspired artificial intelligence models to extract the depth and view angle of the observed scene by modeling the correspondence between different views of that scene. Our paper is built upon previous works in the field of unsupervised depth and relative camera pose estimation from temporal consecutive video frames using deep learning (DL) models. Our approach uses a hybrid learning framework introduced in a recent work called GeoNet, which leverages geometric constraints in the 3D scenes to synthesize a novel view from intermediate DL-based predicted depth and relative pose. However, the state-of-the-art unsupervised depth and pose estimation DL models are exclusively trained/tested on a few available outdoor scene datasets and we have shown they are hardly transferable to new scenes, especially from indoor environments, in which estimation requires higher precision and dealing with probable occlusions. This paper introduces "Indoor GeoNet", a weakly supervised depth and camera pose estimation model targeted for indoor scenes. In Indoor GeoNet, we take advantage of the availability of indoor RGBD datasets collected by human or robot navigators, and added partial (i.e. weak) supervision in depth training into the model. Experimental results showed that our model effectively generalizes to new scenes from different buildings. Indoor GeoNet demonstrated significant depth and pose estimation error reduction when compared to the original GeoNet, while showing 3 times more reconstruction accuracy in synthesizing novel views in indoor environments. △ Less

Submitted 18 November, 2018; originally announced November 2018.

arXiv:1811.07392 [pdf, other]

Facial Expression and Peripheral Physiology Fusion to Decode Individualized Affective Experience

Authors: Yu Yin, Mohsen Nabian, Miolin Fan, ChunAn Chou, Maria Gendron, Sarah Ostadabbas

Abstract: In this paper, we present a multimodal approach to simultaneously analyze facial movements and several peripheral physiological signals to decode individualized affective experiences under positive and negative emotional contexts, while considering their personalized resting dynamics. We propose a person-specific recurrence network to quantify the dynamics present in the person's facial movements… ▽ More In this paper, we present a multimodal approach to simultaneously analyze facial movements and several peripheral physiological signals to decode individualized affective experiences under positive and negative emotional contexts, while considering their personalized resting dynamics. We propose a person-specific recurrence network to quantify the dynamics present in the person's facial movements and physiological data. Facial movement is represented using a robust head vs. 3D face landmark localization and tracking approach, and physiological data are processed by extracting known attributes related to the underlying affective experience. The dynamical coupling between different input modalities is then assessed through the extraction of several complex recurrent network metrics. Inference models are then trained using these metrics as features to predict individual's affective experience in a given context, after their resting dynamics are excluded from their response. We validated our approach using a multimodal dataset consists of (i) facial videos and (ii) several peripheral physiological signals, synchronously recorded from 12 participants while watching 4 emotion-eliciting video-based stimuli. The affective experience prediction results signified that our multimodal fusion method improves the prediction accuracy up to 19% when compared to the prediction using only one or a subset of the input modalities. Furthermore, we gained prediction improvement for affective experience by considering the effect of individualized resting dynamics. △ Less

Submitted 18 November, 2018; originally announced November 2018.

Comments: 2nd IJCAI Workshop on Artificial Intelligence in Affective Computing

arXiv:1808.10721 [pdf, other]

MMDF2018 Workshop Report

Authors: Chun-An Chou, Xiaoning **, Amy Mueller, Sarah Ostadabbas

Abstract: Driven by the recent advances in smart, miniaturized, and mass produced sensors, networked systems, and high-speed data communication and computing, the ability to collect and process larger volumes of higher veracity real-time data from a variety of modalities is expanding. However, despite research thrusts explored since the late 1990's, to date no standard, generalizable solutions have emerged… ▽ More Driven by the recent advances in smart, miniaturized, and mass produced sensors, networked systems, and high-speed data communication and computing, the ability to collect and process larger volumes of higher veracity real-time data from a variety of modalities is expanding. However, despite research thrusts explored since the late 1990's, to date no standard, generalizable solutions have emerged for effectively integrating and processing multimodal data, and consequently practitioners across a wide variety of disciplines must still follow a trial-and-error process to identify the optimum procedure for each individual application and data sources. A deeper understanding of the utility and capabilities (as well as the shortcomings and challenges) of existing multimodal data fusion methods as a function of data and challenge characteristics has the potential to deliver better data analysis tools across all sectors, therein enabling more efficient and effective automated manufacturing, patient care, infrastructure maintenance, environmental understanding, transportation networks, energy systems, etc. There is therefore an urgent need to identify the underlying patterns that can be used to determine a priori which techniques will be most useful for any specific dataset or application. This next stage of understanding and discovery (i.e., the development of generalized solutions) can only be achieved via a high level cross-disciplinary aggregation of learnings, and this workshop was proposed at an opportune time as many domains have already started exploring use of multimodal data fusion techniques in a wide range of application-specific contexts. △ Less

Submitted 30 August, 2018; originally announced August 2018.

Comments: https://www.northeastern.edu/mmdf2018/wp-content/uploads/2018/08/MMDF_2018_Report.pdf

arXiv:1808.02595 [pdf, other]

A Semi-Supervised Data Augmentation Approach using 3D Graphical Engines

Authors: Shuangjun Liu, Sarah Ostadabbas

Abstract: Deep learning approaches have been rapidly adopted across a wide range of fields because of their accuracy and flexibility, but require large labeled training datasets. This presents a fundamental problem for applications with limited, expensive, or private data (i.e. small data), such as human pose and behavior estimation/tracking which could be highly personalized. In this paper, we present a se… ▽ More Deep learning approaches have been rapidly adopted across a wide range of fields because of their accuracy and flexibility, but require large labeled training datasets. This presents a fundamental problem for applications with limited, expensive, or private data (i.e. small data), such as human pose and behavior estimation/tracking which could be highly personalized. In this paper, we present a semi-supervised data augmentation approach that can synthesize large scale labeled training datasets using 3D graphical engines based on a physically-valid low dimensional pose descriptor. To evaluate the performance of our synthesized datasets in training deep learning-based models, we generated a large synthetic human pose dataset, called ScanAva using 3D scans of only 7 individuals based on our proposed augmentation approach. A state-of-the-art human pose estimation deep learning model then was trained from scratch using our ScanAva dataset and could achieve the pose estimation accuracy of 91.2% at PCK0.5 criteria after applying an efficient domain adaptation on the synthetic images, in which its pose estimation accuracy was comparable to the same model trained on large scale pose data from real humans such as MPII dataset and much higher than the model trained on other synthetic human dataset such as SURREAL. △ Less

Submitted 28 September, 2018; v1 submitted 7 August, 2018; originally announced August 2018.

arXiv:1808.02104 [pdf, other]

Inner Space Preserving Generative Pose Machine

Authors: Shuangjun Liu, Sarah Ostadabbas

Abstract: Image-based generative methods, such as generative adversarial networks (GANs) have already been able to generate realistic images with much context control, specially when they are conditioned. However, most successful frameworks share a common procedure which performs an image-to-image translation with pose of figures in the image untouched. When the objective is reposing a figure in an image wh… ▽ More Image-based generative methods, such as generative adversarial networks (GANs) have already been able to generate realistic images with much context control, specially when they are conditioned. However, most successful frameworks share a common procedure which performs an image-to-image translation with pose of figures in the image untouched. When the objective is reposing a figure in an image while preserving the rest of the image, the state-of-the-art mainly assumes a single rigid body with simple background and limited pose shift, which can hardly be extended to the images under normal settings. In this paper, we introduce an image "inner space" preserving model that assigns an interpretable low-dimensional pose descriptor (LDPD) to an articulated figure in the image. Figure reposing is then generated by passing the LDPD and the original image through multi-stage augmented hourglass networks in a conditional GAN structure, called inner space preserving generative pose machine (ISP-GPM). We evaluated ISP-GPM on reposing human figures, which are highly articulated with versatile variations. Test of a state-of-the-art pose estimator on our reposed dataset gave an accuracy over 80% on PCK0.5 metric. The results also elucidated that our ISP-GPM is able to preserve the background with high accuracy while reasonably recovering the area blocked by the figure to be reposed. △ Less

Submitted 6 August, 2018; originally announced August 2018.

Comments: http://www.northeastern.edu/ostadabbas/2018/07/23/inner-space-preserving-generative-pose-machine/

Journal ref: European Conference on Computer Vision (ECCV2018)

arXiv:1806.09514 [pdf, ps, other]

The Emotional Voices Database: Towards Controlling the Emotion Dimension in Voice Generation Systems

Authors: Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, Thierry Dutoit

Abstract: In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous w… ▽ More In this paper, we present a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English and a male actor in French. The database covers 5 emotion classes so it could be suitable to build synthesis and voice transformation systems with the potential to control the emotional dimension in a continuous way. We show the data's efficiency by building a simple MLP system converting neutral to angry speech style and evaluate it via a CMOS perception test. Even though the system is a very simple one, the test show the efficiency of the data which is promising for future work. △ Less

Submitted 25 June, 2018; originally announced June 2018.

Comments: Submitted to SLSP 2018

arXiv:1711.01218 [pdf, other]

Background Subtraction via Fast Robust Matrix Completion

Authors: Behnaz Rezaei, Sarah Ostadabbas

Abstract: Background subtraction is the primary task of the majority of video inspection systems. The most important part of the background subtraction which is common among different algorithms is background modeling. In this regard, our paper addresses the problem of background modeling in a computationally efficient way, which is important for current eruption of "big data" processing coming from high re… ▽ More Background subtraction is the primary task of the majority of video inspection systems. The most important part of the background subtraction which is common among different algorithms is background modeling. In this regard, our paper addresses the problem of background modeling in a computationally efficient way, which is important for current eruption of "big data" processing coming from high resolution multi-channel videos. Our model is based on the assumption that background in natural images lies on a low-dimensional subspace. We formulated and solved this problem in a low-rank matrix completion framework. In modeling the background, we benefited from the in-face extended Frank-Wolfe algorithm for solving a defined convex optimization problem. We evaluated our fast robust matrix completion (fRMC) method on both background models challenge (BMC) and Stuttgart artificial background subtraction (SABS) datasets. The results were compared with the robust principle component analysis (RPCA) and low-rank robust matrix completion (RMC) methods, both solved by inexact augmented Lagrangian multiplier (IALM). The results showed faster computation, at least twice as when IALM solver is used, while having a comparable accuracy even better in some challenges, in subtracting the backgrounds in order to detect moving objects in the scene. △ Less

Submitted 3 November, 2017; originally announced November 2017.

arXiv:1711.01005 [pdf, other]

doi 10.1109/JTEHM.2019.2892970

In-Bed Pose Estimation: Deep Learning with Shallow Dataset

Authors: Shuangjun Liu, Yu Yin, Sarah Ostadabbas

Abstract: Although human pose estimation for various computer vision (CV) applications has been studied extensively in the last few decades, yet in-bed pose estimation using camera-based vision methods has been ignored by the CV community because it is assumed to be identical to the general purpose pose estimation methods. However, in-bed pose estimation has its own specialized aspects and comes with specif… ▽ More Although human pose estimation for various computer vision (CV) applications has been studied extensively in the last few decades, yet in-bed pose estimation using camera-based vision methods has been ignored by the CV community because it is assumed to be identical to the general purpose pose estimation methods. However, in-bed pose estimation has its own specialized aspects and comes with specific challenges including the notable differences in lighting conditions throughout a day and also having different pose distribution from the common human surveillance viewpoint. In this paper, we demonstrate that these challenges significantly lessen the effectiveness of existing general purpose pose estimation models. In order to address the lighting variation challenge, infrared selective (IRS) image acquisition technique is proposed to provide uniform quality data under various lighting conditions. In addition, to deal with unconventional pose perspective, a 2-end histogram of oriented gradient (HOG) rectification method is presented. In this work, we explored the idea of employing a pre-trained convolutional neural network (CNN) model trained on large public datasets of general human poses and fine-tuning the model using our own shallow in-bed IRS dataset. We developed an IRS imaging system and collected IRS image data from several realistic life-size mannequins in a simulated hospital room environment. A pre-trained CNN called convolutional pose machine (CPM) was repurposed for in-bed pose estimation by fine-tuning its specific intermediate layers. Using the HOG rectification method, the pose estimation performance of CPM significantly improved by 26.4% in PCK0.1 criteria compared to the model without such rectification. △ Less

Submitted 7 July, 2018; v1 submitted 2 November, 2017; originally announced November 2017.

Journal ref: IEEE Journal of Translational Engineering in Health and Medicine 2019

arXiv:1606.04165 [pdf, other]

Using Virtual Humans to Understand Real Ones

Authors: Katie Hoemann, Behnaz Rezaei, Stacy C. Marsella, Sarah Ostadabbas

Abstract: Human interactions are characterized by explicit as well as implicit channels of communication. While the explicit channel transmits overt messages, the implicit ones transmit hidden messages about the communicator (e.g., his/her intentions and attitudes). There is a growing consensus that providing a computer with the ability to manipulate implicit affective cues should allow for a more meaningfu… ▽ More Human interactions are characterized by explicit as well as implicit channels of communication. While the explicit channel transmits overt messages, the implicit ones transmit hidden messages about the communicator (e.g., his/her intentions and attitudes). There is a growing consensus that providing a computer with the ability to manipulate implicit affective cues should allow for a more meaningful and natural way of studying particular non-verbal signals of human-human communications by human-computer interactions. In this pilot study, we created a non-dynamic human-computer interaction while manipulating three specific non-verbal channels of communication: gaze pattern, facial expression, and gesture. Participants rated the virtual agent on affective dimensional scales (pleasure, arousal, and dominance) while their physiological signal (electrodermal activity, EDA) was captured during the interaction. Assessment of the behavioral data revealed a significant and complex three-way interaction between gaze, gesture, and facial configuration on the dimension of pleasure, as well as a main effect of gesture on the dimension of dominance. These results suggest a complex relationship between different non-verbal cues and the social context in which they are interpreted. Qualifying considerations as well as possible next steps are further discussed in light of these exploratory findings. △ Less

Submitted 13 June, 2016; originally announced June 2016.

arXiv:1606.00370 [pdf, other]

Decoding Emotional Experience through Physiological Signal Processing

Authors: Maria S. Perez-Rosero, Behnaz Rezaei, Murat Akcakaya, Sarah Ostadabbas

Abstract: There is an increasing consensus among re- searchers that making a computer emotionally intelligent with the ability to decode human affective states would allow a more meaningful and natural way of human-computer interactions (HCIs). One unobtrusive and non-invasive way of recognizing human affective states entails the exploration of how physiological signals vary under different emotional experi… ▽ More There is an increasing consensus among re- searchers that making a computer emotionally intelligent with the ability to decode human affective states would allow a more meaningful and natural way of human-computer interactions (HCIs). One unobtrusive and non-invasive way of recognizing human affective states entails the exploration of how physiological signals vary under different emotional experiences. In particular, this paper explores the correlation between autonomically-mediated changes in multimodal body signals and discrete emotional states. In order to fully exploit the information in each modality, we have provided an innovative classification approach for three specific physiological signals including Electromyogram (EMG), Blood Volume Pressure (BVP) and Galvanic Skin Response (GSR). These signals are analyzed as inputs to an emotion recognition paradigm based on fusion of a series of weak learners. Our proposed classification approach showed 88.1% recognition accuracy, which outperformed the conventional Support Vector Machine (SVM) classifier with 17% accuracy improvement. Furthermore, in order to avoid information redundancy and the resultant over-fitting, a feature reduction method is proposed based on a correlation analysis to optimize the number of features required for training and validating each weak learner. Results showed that despite the feature space dimensionality reduction from 27 to 18 features, our methodology preserved the recognition accuracy of about 85.0%. This reduction in complexity will get us one step closer towards embedding this human emotion encoder in the wireless and wearable HCI platforms. △ Less

Submitted 1 June, 2016; originally announced June 2016.

Showing 1–40 of 40 results for author: Ostadabbas, S