Search | arXiv e-print repository

Towards Open Domain Text-Driven Synthesis of Multi-Person Motions

Authors: Mengyi Shan, Lu Dong, Yutao Han, Yuan Yao, Tao Liu, Ifeoma Nwogu, Guo-Jun Qi, Mitch Hill

Abstract: This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pos… ▽ More This work aims to generate natural and diverse group motions of multiple humans from textual descriptions. While single-person text-to-motion generation is extensively studied, it remains challenging to synthesize motions for more than one or two subjects from in-the-wild prompts, mainly due to the lack of available datasets. In this work, we curate human pose and motion datasets by estimating pose information from large-scale image and video datasets. Our models use a transformer-based diffusion framework that accommodates multiple datasets with any number of subjects or frames. Experiments explore both generation of multi-person static poses and generation of multi-person motion sequences. To our knowledge, our method is the first to generate multi-subject motion sequences with high diversity and fidelity from a large variety of textual prompts. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Project page: https://shanmy.github.io/Multi-Motion/

arXiv:2405.07974 [pdf, other]

SignAvatar: Sign Language 3D Motion Reconstruction and Generation

Authors: Lu Dong, Lipisha Chaudhary, Fei Xu, Xiao Wang, Mason Lary, Ifeoma Nwogu

Abstract: Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generati… ▽ More Achieving expressive 3D motion reconstruction and automatic generation for isolated sign words can be challenging, due to the lack of real-world 3D sign-word data, the complex nuances of signing motions, and the cross-modal understanding of sign language semantics. To address these challenges, we introduce SignAvatar, a framework capable of both word-level sign language reconstruction and generation. SignAvatar employs a transformer-based conditional variational autoencoder architecture, effectively establishing relationships across different semantic modalities. Additionally, this approach incorporates a curriculum learning strategy to enhance the model's robustness and generalization, resulting in more realistic motions. Furthermore, we contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face, for unique sign words. We demonstrate the effectiveness of SignAvatar through extensive experiments, showcasing its superior reconstruction and automatic generation capabilities. The code and dataset are available on the project page. △ Less

Submitted 13 May, 2024; originally announced May 2024.

Comments: Accepted by FG2024

arXiv:2308.09611 [pdf, other]

Language-guided Human Motion Synthesis with Atomic Actions

Authors: Yuanhao Zhai, Mingzhen Huang, Tianyu Luan, Lu Dong, Ifeoma Nwogu, Siwei Lyu, David Doermann, Junsong Yuan

Abstract: Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and emplo… ▽ More Language-guided human motion synthesis has been a challenging task due to the inherent complexity and diversity of human behaviors. Previous methods face limitations in generalization to novel actions, often resulting in unrealistic or incoherent motion sequences. In this paper, we propose ATOM (ATomic mOtion Modeling) to mitigate this problem, by decomposing actions into atomic actions, and employing a curriculum learning strategy to learn atomic action composition. First, we disentangle complex human motions into a set of atomic actions during learning, and then assemble novel actions using the learned atomic actions, which offers better adaptability to new actions. Moreover, we introduce a curriculum learning training strategy that leverages masked motion modeling with a gradual increase in the mask ratio, and thus facilitates atomic action assembly. This approach mitigates the overfitting problem commonly encountered in previous methods while enforcing the model to learn better motion representations. We demonstrate the effectiveness of ATOM through extensive experiments, including text-to-motion and action-to-motion synthesis tasks. We further illustrate its superiority in synthesizing plausible and coherent text-guided human motion sequences. △ Less

Submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted to ACM MM 2023, code: https://github.com/yhZhai/ATOM

arXiv:2212.02848 [pdf, other]

SignNet: Single Channel Sign Generation using Metric Embedded Learning

Authors: Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu

Abstract: A true interpreting agent not only understands sign language and translates to text, but also understands text and translates to signs. Much of the AI work in sign language translation to date has focused mainly on translating from signs to text. Towards the latter goal, we propose a text-to-sign translation model, SignNet, which exploits the notion of similarity (and dissimilarity) of visual sign… ▽ More A true interpreting agent not only understands sign language and translates to text, but also understands text and translates to signs. Much of the AI work in sign language translation to date has focused mainly on translating from signs to text. Towards the latter goal, we propose a text-to-sign translation model, SignNet, which exploits the notion of similarity (and dissimilarity) of visual signs in translating. This module presented is only one part of a dual-learning two task process involving text-to-sign (T2S) as well as sign-to-text (S2T). We currently implement SignNet as a single channel architecture so that the output of the T2S task can be fed into S2T in a continuous dual learning framework. By single channel, we refer to a single modality, the body pose joints. In this work, we present SignNet, a T2S task using a novel metric embedding learning process, to preserve the distances between sign embeddings relative to their dissimilarity. We also describe how to choose positive and negative examples of signs for similarity testing. From our analysis, we observe that metric embedding learning-based model perform significantly better than the other models with traditional losses, when evaluated using BLEU scores. In the task of gloss to pose, SignNet performed as well as its state-of-the-art (SoTA) counterparts and outperformed them in the task of text to pose, by showing noteworthy enhancements in BLEU 1 - BLEU 4 scores (BLEU 1: 31->39; ~26% improvement and BLEU 4: 10.43->11.84; ~14\% improvement) when tested on the popular RWTH PHOENIX-Weather-2014T benchmark dataset △ Less

Submitted 6 December, 2022; originally announced December 2022.

Comments: 9 pages, 4 figures, 4 tables - IEEE Face and Gestures, 2023

arXiv:2207.04566 [pdf, other]

A Probabilistic Model Of Interaction Dynamics for Dyadic Face-to-Face Settings

Authors: Renke Wang, Ifeoma Nwogu

Abstract: Natural conversations between humans often involve a large number of non-verbal nuanced expressions, displayed at key times throughout the conversation. Understanding and being able to model these complex interactions is essential for creating realistic human-agent communication, whether in the virtual or physical world. As social robots and intelligent avatars emerge in popularity and utility, be… ▽ More Natural conversations between humans often involve a large number of non-verbal nuanced expressions, displayed at key times throughout the conversation. Understanding and being able to model these complex interactions is essential for creating realistic human-agent communication, whether in the virtual or physical world. As social robots and intelligent avatars emerge in popularity and utility, being able to realistically model and generate these dynamic expressions throughout conversations is critical. We develop a probabilistic model to capture the interaction dynamics between pairs of participants in a face-to-face setting, allowing for the encoding of synchronous expressions between the interlocutors. This interaction encoding is then used to influence the generation when predicting one agent's future dynamics, conditioned on the other's current dynamics. FLAME features are extracted from videos containing natural conversations between subjects to train our interaction model. We successfully assess the efficacy of our proposed model via quantitative metrics and qualitative metrics, and show that it successfully captures the dynamics of a pair of interacting dyads. We also test the model with a never-before-seen parent-infant dataset comprising of two different modes of communication between the dyads, and show that our model successfully delineates between the modes, based on their interacting dynamics. △ Less

Submitted 10 July, 2022; originally announced July 2022.

arXiv:2206.01820 [pdf, other]

A Robust Backpropagation-Free Framework for Images

Authors: Timothy Zee, Alexander G. Ororbia, Ankur Mali, Ifeoma Nwogu

Abstract: While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowled… ▽ More While current deep learning algorithms have been successful for a wide variety of artificial intelligence (AI) tasks, including those involving structured image data, they present deep neurophysiological conceptual issues due to their reliance on the gradients that are computed by backpropagation of errors (backprop). Gradients are required to obtain synaptic weight adjustments but require knowledge of feed-forward activities in order to conduct backward propagation, a biologically implausible process. This is known as the "weight transport problem". Therefore, in this work, we present a more biologically plausible approach towards solving the weight transport problem for image data. This approach, which we name the error kernel driven activation alignment (EKDAA) algorithm, accomplishes through the introduction of locally derived error transmission kernels and error maps. Like standard deep learning networks, EKDAA performs the standard forward process via weights and activation functions; however, its backward error computation involves adaptive error kernels that propagate local error signals through the network. The efficacy of EKDAA is demonstrated by performing visual-recognition tasks on the Fashion MNIST, CIFAR-10 and SVHN benchmarks, along with demonstrating its ability to extract visual features from natural color images. Furthermore, in order to demonstrate its non-reliance on gradient computations, results are presented for an EKDAA trained CNN that employs a non-differentiable activation function. △ Less

Submitted 5 November, 2023; v1 submitted 3 June, 2022; originally announced June 2022.

arXiv:2011.00559 [pdf, other]

WLV-RIT at HASOC-Dravidian-CodeMix-FIRE2020: Offensive Language Identification in Code-switched YouTube Comments

Authors: Tharindu Ranasinghe, Sarthak Gupte, Marcos Zampieri, Ifeoma Nwogu

Abstract: This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020. The HASOC 2020 organizers provided participants with annotated datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English). We participated in task 1: Offensive comment identification in Code-mixed… ▽ More This paper describes the WLV-RIT entry to the Hate Speech and Offensive Content Identification in Indo-European Languages (HASOC) shared task 2020. The HASOC 2020 organizers provided participants with annotated datasets containing social media posts of code-mixed in Dravidian languages (Malayalam-English and Tamil-English). We participated in task 1: Offensive comment identification in Code-mixed Malayalam Youtube comments. In our methodology, we take advantage of available English data by applying cross-lingual contextual word embeddings and transfer learning to make predictions to Malayalam data. We further improve the results using various fine tuning strategies. Our system achieved 0.89 weighted average F1 score for the test set and it ranked 5th place out of 12 participants. △ Less

Submitted 1 November, 2020; originally announced November 2020.

Comments: Accepted to FIRE 2020

arXiv:2009.01468 [pdf, other]

Modeling Global Body Configurations in American Sign Language

Authors: Nicholas Wilkins, Beck Cordes Galbraith, Ifeoma Nwogu

Abstract: American Sign Language (ASL) is the fourth most commonly used language in the United States and is the language most commonly used by Deaf people in the United States and the English-speaking regions of Canada. Unfortunately, until recently, ASL received little research. This is due, in part, to its delayed recognition as a language until William C. Stokoe's publication in 1960. Limited data has b… ▽ More American Sign Language (ASL) is the fourth most commonly used language in the United States and is the language most commonly used by Deaf people in the United States and the English-speaking regions of Canada. Unfortunately, until recently, ASL received little research. This is due, in part, to its delayed recognition as a language until William C. Stokoe's publication in 1960. Limited data has been a long-standing obstacle to ASL research and computational modeling. The lack of large-scale datasets has prohibited many modern machine-learning techniques, such as Neural Machine Translation, from being applied to ASL. In addition, the modality required to capture sign language (i.e. video) is complex in natural settings (as one must deal with background noise, motion blur, and the curse of dimensionality). Finally, when compared with spoken languages, such as English, there has been limited research conducted into the linguistics of ASL. We realize a simplified version of Liddell and Johnson's Movement-Hold (MH) Model using a Probabilistic Graphical Model (PGM). We trained our model on ASLing, a dataset collected from three fluent ASL signers. We evaluate our PGM against other models to determine its ability to model ASL. Finally, we interpret various aspects of the PGM and draw conclusions about ASL phonetics. The main contributions of this paper are △ Less

Submitted 3 September, 2020; originally announced September 2020.

arXiv:1912.02163 [pdf, other]

Regression with Uncertainty Quantification in Large Scale Complex Data

Authors: Nicholas Wilkins, Michael Johnson, Ifeoma Nwogu

Abstract: While several methods for predicting uncertainty on deep networks have been recently proposed, they do not readily translate to large and complex datasets. In this paper we utilize a simplified form of the Mixture Density Networks (MDNs) to produce a one-shot approach to quantify uncertainty in regression problems. We show that our uncertainty bounds are on-par or better than other reported existi… ▽ More While several methods for predicting uncertainty on deep networks have been recently proposed, they do not readily translate to large and complex datasets. In this paper we utilize a simplified form of the Mixture Density Networks (MDNs) to produce a one-shot approach to quantify uncertainty in regression problems. We show that our uncertainty bounds are on-par or better than other reported existing methods. When applied to standard regression benchmark datasets, we show an improvement in predictive log-likelihood and root-mean-square-error when compared to existing state-of-the-art methods. We also demonstrate this method's efficacy on stochastic, highly volatile time-series data where stock prices are predicted for the next time interval. The resulting uncertainty graph summarizes significant anomalies in the stock price chart. Furthermore, we apply this method to the task of age estimation from the challenging IMDb-Wiki dataset of half a million face images. We successfully predict the uncertainties associated with the prediction and empirically analyze the underlying causes of the uncertainties. This uncertainty quantification can be used to pre-process low quality datasets and further enable learning. △ Less

Submitted 4 December, 2019; originally announced December 2019.

arXiv:1807.06124 [pdf, other]

Computational Social Dynamics: Analyzing the Face-level Interactions in a Group

Authors: Nicholas Watkins, Ifeoma Nwogu

Abstract: Interactional synchrony refers to how the speech or behavior of two or more people involved in a conversation become more finely synchronized with each other, and they can appear to behave almost in direct response to one another. Studies have shown that interactional synchrony is a hallmark of relationships, and is produced as a result of rapport. %Research has also shown that up to two-thirds of… ▽ More Interactional synchrony refers to how the speech or behavior of two or more people involved in a conversation become more finely synchronized with each other, and they can appear to behave almost in direct response to one another. Studies have shown that interactional synchrony is a hallmark of relationships, and is produced as a result of rapport. %Research has also shown that up to two-thirds of human communication occurs via nonverbal channels such as gestures (or body movements), facial expressions, \etc. In this work, we use computer vision based methods to extract nonverbal cues, specifically from the face, and develop a model to measure interactional synchrony based on those cues. This paper illustrates a novel method of constructing a dynamic deep neural architecture, specifically made up of intermediary long short-term memory networks (LSTMs), useful for learning and predicting the extent of synchrony between two or more processes, by emulating the nonlinear dependencies between them. On a synthetic dataset, where pairs of sequences were generated from a Gaussian process with known covariates, the architecture could successfully determine the covariance values of the generating process within an error of 0.5% when tested on 100 pairs of interacting signals. On a real-life dataset involving groups of three people, the model successfully estimated the extent of synchrony of each group on a scale of 1 to 5, with an overall prediction mean of $2.96%$ error when performing 5-fold validation, as compared to 26.1% on the random permutations serving as the control baseline. △ Less

Submitted 16 July, 2018; originally announced July 2018.

arXiv:1412.2404 [pdf, other]

Dimensionality Reduction with Subspace Structure Preservation

Authors: Devansh Arpit, Ifeoma Nwogu, Venu Govindaraju

Abstract: Modeling data as being sampled from a union of independent subspaces has been widely applied to a number of real world applications. However, dimensionality reduction approaches that theoretically preserve this independence assumption have not been well studied. Our key contribution is to show that $2K$ projection vectors are sufficient for the independence preservation of any $K$ class data sampl… ▽ More Modeling data as being sampled from a union of independent subspaces has been widely applied to a number of real world applications. However, dimensionality reduction approaches that theoretically preserve this independence assumption have not been well studied. Our key contribution is to show that $2K$ projection vectors are sufficient for the independence preservation of any $K$ class data sampled from a union of independent subspaces. It is this non-trivial observation that we use for designing our dimensionality reduction technique. In this paper, we propose a novel dimensionality reduction algorithm that theoretically preserves this structure for a given dataset. We support our theoretical analysis with empirical results on both synthetic and real world data achieving \textit{state-of-the-art} results compared to popular dimensionality reduction techniques. △ Less

Submitted 6 April, 2016; v1 submitted 7 December, 2014; originally announced December 2014.

Comments: Published in NIPS 2014; v2: minor updates to the algorithm and added a few lines addressing application to large-scale/high-dimensional data

arXiv:1409.6745 [pdf, other]

A Concept Learning Approach to Multisensory Object Perception

Authors: Ifeoma Nwogu, Goker Erdogan, Ilker Yildirim, Robert Jacobs

Abstract: This paper presents a computational model of concept learning using Bayesian inference for a grammatically structured hypothesis space, and test the model on multisensory (visual and haptics) recognition of 3D objects. The study is performed on a set of artificially generated 3D objects known as fribbles, which are complex, multipart objects with categorical structures. The goal of this work is to… ▽ More This paper presents a computational model of concept learning using Bayesian inference for a grammatically structured hypothesis space, and test the model on multisensory (visual and haptics) recognition of 3D objects. The study is performed on a set of artificially generated 3D objects known as fribbles, which are complex, multipart objects with categorical structures. The goal of this work is to develop a working multisensory representational model that integrates major themes on concepts and concepts learning from the cognitive science literature. The model combines the representational power of a probabilistic generative grammar with the inferential power of Bayesian induction. △ Less

Submitted 23 September, 2014; originally announced September 2014.

Comments: 6 pages and 6 figures

arXiv:1405.1380 [pdf, other]

Is Joint Training Better for Deep Auto-Encoders?

Authors: Yingbo Zhou, Devansh Arpit, Ifeoma Nwogu, Venu Govindaraju

Abstract: Traditionally, when generative models of data are developed via deep architectures, greedy layer-wise pre-training is employed. In a well-trained model, the lower layer of the architecture models the data distribution conditional upon the hidden variables, while the higher layers model the hidden distribution prior. But due to the greedy scheme of the layerwise training technique, the parameters o… ▽ More Traditionally, when generative models of data are developed via deep architectures, greedy layer-wise pre-training is employed. In a well-trained model, the lower layer of the architecture models the data distribution conditional upon the hidden variables, while the higher layers model the hidden distribution prior. But due to the greedy scheme of the layerwise training technique, the parameters of lower layers are fixed when training higher layers. This makes it extremely challenging for the model to learn the hidden distribution prior, which in turn leads to a suboptimal model for the data distribution. We therefore investigate joint training of deep autoencoders, where the architecture is viewed as one stack of two or more single-layer autoencoders. A single global reconstruction objective is jointly optimized, such that the objective for the single autoencoders at each layer acts as a local, layer-level regularizer. We empirically evaluate the performance of this joint training scheme and observe that it not only learns a better data model, but also learns better higher layer representations, which highlights its potential for unsupervised feature learning. In addition, we find that the usage of regularizations in the joint training scheme is crucial in achieving good performance. In the supervised setting, joint training also shows superior performance when training deeper models. The joint training framework can thus provide a platform for investigating more efficient usage of different types of regularizers, especially in light of the growing volumes of available unlabeled data. △ Less

Submitted 15 June, 2015; v1 submitted 6 May, 2014; originally announced May 2014.

Comments: 11 pages, 4 figures

arXiv:1401.4489 [pdf, other]

An Analysis of Random Projections in Cancelable Biometrics

Authors: Devansh Arpit, Ifeoma Nwogu, Gaurav Srivastava, Venu Govindaraju

Abstract: With increasing concerns about security, the need for highly secure physical biometrics-based authentication systems utilizing \emph{cancelable biometric} technologies is on the rise. Because the problem of cancelable template generation deals with the trade-off between template security and matching performance, many state-of-the-art algorithms successful in generating high quality cancelable bio… ▽ More With increasing concerns about security, the need for highly secure physical biometrics-based authentication systems utilizing \emph{cancelable biometric} technologies is on the rise. Because the problem of cancelable template generation deals with the trade-off between template security and matching performance, many state-of-the-art algorithms successful in generating high quality cancelable biometrics all have random projection as one of their early processing steps. This paper therefore presents a formal analysis of why random projections is an essential step in cancelable biometrics. By formally defining the notion of an \textit{Independent Subspace Structure} for datasets, it can be shown that random projection preserves the subspace structure of data vectors generated from a union of independent linear subspaces. The bound on the minimum number of random vectors required for this to hold is also derived and is shown to depend logarithmically on the number of data samples, not only in independent subspaces but in disjoint subspace settings as well. The theoretical analysis presented is supported in detail with empirical results on real-world face recognition datasets. △ Less

Submitted 13 November, 2014; v1 submitted 17 January, 2014; originally announced January 2014.

Showing 1–14 of 14 results for author: Nwogu, I