Skip to main content

Showing 1–27 of 27 results for author: Espy-Wilson, C

.
  1. arXiv:2406.09706  [pdf, other

    eess.AS

    A Multimodal Framework for the Assessment of the Schizophrenia Spectrum

    Authors: Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Sonia Bansal, Deanna L. Kelly, Carol Espy-Wilson

    Abstract: This paper presents a novel multimodal framework to distinguish between different symptom classes of subjects in the schizophrenia spectrum and healthy controls using audio, video, and text modalities. We implemented Convolution Neural Network and Long Short Term Memory based unimodal models and experimented on various multimodal fusion approaches to come up with the proposed framework. We utilize… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Accepted to be presented at Interspeech 2024

  2. arXiv:2406.05947  [pdf, other

    eess.AS

    Accent Conversion with Articulatory Representations

    Authors: Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, Andrea Fanelli

    Abstract: Conversion of non-native accented speech to native (American) English has a wide range of applications such as improving intelligibility of non-native speech. Previous work on this domain has used phonetic posteriograms as the target speech representation to train an acoustic model which is then used to extract a compact representation of input speech for accent conversion. In this work, we introd… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  3. arXiv:2405.13018  [pdf, other

    cs.CL cs.AI eess.AS

    Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings

    Authors: Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, **g Liu, Carol Espy-Wilson

    Abstract: Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-ba… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

  4. arXiv:2309.15136  [pdf, other

    eess.SP cs.MM cs.SD eess.AS eess.IV

    A multi-modal approach for identifying schizophrenia using cross-modal attention

    Authors: Gowtham Premananth, Yashish M. Siriwardena, Philip Resnik, Carol Espy-Wilson

    Abstract: This study focuses on how different modalities of human communication can be used to distinguish between healthy controls and subjects with schizophrenia who exhibit strong positive symptoms. We developed a multi-modal schizophrenia classification system using audio, video, and text. Facial action units and vocal tract variables were extracted as low-level features from video and audio respectivel… ▽ More

    Submitted 18 April, 2024; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted to Annual International Conference of the IEEE Engineering in Medicine and Biology Society 2024

  5. arXiv:2309.09220  [pdf, other

    eess.AS cs.AI cs.SD

    Improving Speech Inversion Through Self-Supervised Embeddings and Enhanced Tract Variables

    Authors: Ahmed Adel Attia, Yashish M. Siriwardena, Carol Espy-Wilson

    Abstract: The performance of deep learning models depends significantly on their capacity to encode input features efficiently and decode them into meaningful outputs. Better input and output representation has the potential to boost models' performance and generalization. In the context of acoustic-to-articulatory speech inversion (SI) systems, we study the impact of utilizing speech representations acquir… ▽ More

    Submitted 17 September, 2023; originally announced September 2023.

  6. arXiv:2309.07927  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

    Authors: Ahmed Adel Attia, **g Liu, Wei Ai, Dorottya Demszky, Carol Espy-Wilson

    Abstract: Recent advancements in Automatic Speech Recognition (ASR) systems, exemplified by Whisper, have demonstrated the potential of these systems to approach human-level performance given sufficient data. However, this progress doesn't readily extend to ASR for children due to the limited availability of suitable child-specific databases and the distinct characteristics of children's speech. A recent st… ▽ More

    Submitted 15 May, 2024; v1 submitted 12 September, 2023; originally announced September 2023.

  7. arXiv:2306.00203  [pdf, ps, other

    eess.AS

    Speaker-independent Speech Inversion for Estimation of Nasalance

    Authors: Yashish M. Siriwardena, Carol Espy-Wilson, Suzanne Boyce, Mark K. Tiede, Liran Oren

    Abstract: The velopharyngeal (VP) valve regulates the opening between the nasal and oral cavities. This valve opens and closes through a coordinated motion of the velum and pharyngeal walls. Nasalance is an objective measure derived from the oral and nasal acoustic signals that correlate with nasality. In this work, we evaluate the degree to which the nasalance measure reflects fine-grained patterns of VP m… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: Interspeech 2023

  8. Acoustic-to-Articulatory Speech Inversion Features for Mispronunciation Detection of /r/ in Child Speech Sound Disorders

    Authors: Nina R Benway, Yashish M Siriwardena, Jonathan L Preston, Elaine Hitchcock, Tara McAllister, Carol Espy-Wilson

    Abstract: Acoustic-to-articulatory speech inversion could enhance automated clinical mispronunciation detection to provide detailed articulatory feedback unattainable by formant-based mispronunciation detection algorithms; however, it is unclear the extent to which a speech inversion system trained on adult speech performs in the context of (1) child and (2) clinical speech. In the absence of an articulator… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: *denotes equal contribution. To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023

    Journal ref: Proc. INTERSPEECH 2023, 4568-4572

  9. Enhancing Speech Articulation Analysis using a Geometric Transformation of the X-ray Microbeam Dataset

    Authors: Ahmed Adel Attia, Mark Tiede, Carol Y. Espy-Wilson

    Abstract: Accurate analysis of speech articulation is crucial for speech analysis. However, X-Y coordinates of articulators strongly depend on the anatomy of the speakers and the variability of pellet placements, and existing methods for map** anatomical landmarks in the X-ray Microbeam Dataset (XRMB) fail to capture the entire anatomy of the vocal tract. In this paper, we propose a new geometric transfor… ▽ More

    Submitted 28 September, 2023; v1 submitted 18 May, 2023; originally announced May 2023.

  10. arXiv:2210.16454  [pdf, ps, other

    eess.AS cs.SD

    Learning to Compute the Articulatory Representations of Speech with the MIRRORNET

    Authors: Yashish M. Siriwardena, Carol Espy-Wilson, Shihab Shamma

    Abstract: Most organisms including humans function by coordinating and integrating sensory signals with motor actions to survive and accomplish desired tasks. Learning these complex sensorimotor map**s proceeds simultaneously and often in an unsupervised or semi-supervised fashion. An autoencoder architecture (MirrorNet) inspired by this sensorimotor learning paradigm is explored in this work to control a… ▽ More

    Submitted 25 May, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

    Comments: Interspeech 2023

    Journal ref: Interspeech 2023

  11. arXiv:2210.16450  [pdf, ps, other

    eess.AS cs.SD

    The Secret Source : Incorporating Source Features to Improve Acoustic-to-Articulatory Speech Inversion

    Authors: Yashish M. Siriwardena, Carol Espy-Wilson

    Abstract: In this work, we incorporated acoustically derived source features, aperiodicity, periodicity and pitch as additional targets to an acoustic-to-articulatory speech inversion (SI) system. We also propose a Temporal Convolution based SI system, which uses auditory spectrograms as the input speech representation, to learn long-range dependencies and complex interactions between the source and vocal t… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

  12. Masked Autoencoders Are Articulatory Learners

    Authors: Ahmed Adel Attia, Carol Espy-Wilson

    Abstract: Articulatory recordings track the positions and motion of different articulators along the vocal tract and are widely used to study speech production and to develop speech technologies such as articulatory based speech synthesizers and speech inversion systems. The University of Wisconsin X-Ray microbeam (XRMB) dataset is one of various datasets that provide articulatory recordings synced with aud… ▽ More

    Submitted 18 May, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

  13. arXiv:2206.09556  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models

    Authors: Rahil Parikh, Gaspar Rochette, Carol Espy-Wilson, Shihab Shamma

    Abstract: End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical cue for these networks to group sources, in this work, we perform a thorough investigation on ConvT… ▽ More

    Submitted 19 June, 2022; originally announced June 2022.

    Comments: Accepted at Interspeech 2022

  14. Acoustic-to-articulatory Speech Inversion with Multi-task Learning

    Authors: Yashish M. Siriwardena, Ganesh Sivaraman, Carol Espy-Wilson

    Abstract: Multi-task learning (MTL) frameworks have proven to be effective in diverse speech related tasks like automatic speech recognition (ASR) and speech emotion recognition. This paper proposes a MTL framework to perform acoustic-to-articulatory speech inversion by simultaneously learning an acoustic to phoneme map** as a shared task. We use the Haskins Production Rate Comparison (HPRC) database whic… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

    Journal ref: Proc. Interspeech 2022

  15. arXiv:2205.13086  [pdf, ps, other

    eess.AS

    Audio Data Augmentation for Acoustic-to-articulatory Speech Inversion using Bidirectional Gated RNNs

    Authors: Yashish M. Siriwardena, Ahmed Adel Attia, Ganesh Sivaraman, Carol Espy-Wilson

    Abstract: Data augmentation has proven to be a promising prospect in improving the performance of deep learning models by adding variability to training data. In previous work with develo** a noise robust acoustic-to-articulatory speech inversion system, we have shown the importance of noise augmentation to improve the performance of speech inversion in noisy speech. In this work, we compare and contrast… ▽ More

    Submitted 31 May, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: EUSIPCO 2023

  16. arXiv:2203.05780  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Acoustic To Articulatory Speech Inversion Using Multi-Resolution Spectro-Temporal Representations Of Speech Signals

    Authors: Rahil Parikh, Nadee Seneviratne, Ganesh Sivaraman, Shihab Shamma, Carol Espy-Wilson

    Abstract: Multi-resolution spectro-temporal features of a speech signal represent how the brain perceives sounds by tuning cortical cells to different spectral and temporal modulations. These features produce a higher dimensional representation of the speech signals. The purpose of this paper is to evaluate how well the auditory cortex representation of speech signals contribute to estimate articulatory fea… ▽ More

    Submitted 25 June, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

    Comments: Accepted at ISCA Interspeech 2022

  17. arXiv:2203.04420  [pdf, other

    eess.AS cs.AI cs.LG cs.SD eess.SP

    Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems

    Authors: Rahil Parikh, Ilya Kavalerov, Carol Espy-Wilson, Shihab Shamma

    Abstract: Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net. We eval… ▽ More

    Submitted 8 March, 2022; originally announced March 2022.

    Comments: 5 pages, IEEE International Conference on Acoustics, Speech, & Signal Processing (ICASSP), 2022

  18. arXiv:2202.06238  [pdf, other

    eess.AS cs.CL

    Multimodal Depression Classification Using Articulatory Coordination Features And Hierarchical Attention Based Text Embeddings

    Authors: Nadee Seneviratne, Carol Espy-Wilson

    Abstract: Multimodal depression classification has gained immense popularity over the recent years. We develop a multimodal depression classification system using articulatory coordination features extracted from vocal tract variables and text transcriptions obtained from an automatic speech recognition tool that yields improvements of area under the receiver operating characteristics curve compared to uni-… ▽ More

    Submitted 13 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. arXiv admin note: text overlap with arXiv:2104.04195

  19. arXiv:2110.04440  [pdf, other

    eess.AS cs.MM cs.SD

    Multimodal Approach for Assessing Neuromotor Coordination in Schizophrenia Using Convolutional Neural Networks

    Authors: Yashish M. Siriwardena, Chris Kitchen, Deanna L. Kelly, Carol Espy-Wilson

    Abstract: This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g. hallucinations and delusions), using two distinct channel-delay correlation methods. We show that the schizophrenic subjects with strong positive symptoms and who are markedly ill pose complex articulatory coordination pattern in facial and speech gestures than what is o… ▽ More

    Submitted 8 October, 2021; originally announced October 2021.

    Comments: 5 pages. arXiv admin note: text overlap with arXiv:2102.07054

    Journal ref: Proceedings of the 2021 International Conference on Multimodal Interaction

  20. arXiv:2104.04195  [pdf, other

    eess.AS cs.LG

    Speech based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model

    Authors: Nadee Seneviratne, Carol Espy-Wilson

    Abstract: Speech based depression classification has gained immense popularity over the recent years. However, most of the classification studies have focused on binary classification to distinguish depressed subjects from non-depressed subjects. In this paper, we formulate the depression classification task as a severity level classification problem to provide more granularity to the classification outcome… ▽ More

    Submitted 9 April, 2021; originally announced April 2021.

    Comments: 5 pages, submitted to Interspeech 2021. arXiv admin note: text overlap with arXiv:2011.06739

  21. arXiv:2102.07054  [pdf, other

    eess.AS

    Inverted Vocal Tract Variables and Facial Action Units to Quantify Neuromotor Coordination in Schizophrenia

    Authors: Yashish Maduwantha H. P. E. R. S, Chris Kitchen, Deanna L. Kelly, Carol Espy-Wilson

    Abstract: This study investigates the speech articulatory coordination in schizophrenia subjects exhibiting strong positive symptoms (e.g.hallucinations and delusions), using a time delay embedded correlation analysis. We show that the schizophrenia subjects with strong positive symptoms and who are markedly ill pose complex coordination patterns in facial and speech gestures than what is observed in health… ▽ More

    Submitted 13 February, 2021; originally announced February 2021.

    Comments: Conference

  22. arXiv:2011.06739  [pdf, other

    eess.AS cs.LG

    Generalized Dilated CNN Models for Depression Detection Using Inverted Vocal Tract Variables

    Authors: Nadee Seneviratne, Carol Espy-Wilson

    Abstract: Depression detection using vocal biomarkers is a highly researched area. Articulatory coordination features (ACFs) are developed based on the changes in neuromotor coordination due to psychomotor slowing, a key feature of Major Depressive Disorder. However findings of existing studies are mostly validated on a single database which limits the generalizability of results. Variability across differe… ▽ More

    Submitted 9 April, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

    Comments: 5 pages, Submitted to Interspeech 2021

  23. Spoken Language Interaction with Robots: Research Issues and Recommendations, Report from the NSF Future Directions Workshop

    Authors: Matthew Marge, Carol Espy-Wilson, Nigel Ward

    Abstract: With robotics rapidly advancing, more effective human-robot interaction is increasingly needed to realize the full potential of robots for society. While spoken language must be part of the solution, our ability to provide spoken language interaction capabilities is still very limited. The National Science Foundation accordingly convened a workshop, bringing together speech, language, and robotics… ▽ More

    Submitted 10 November, 2020; originally announced November 2020.

    Comments: 35 pages, 6 figures, Final report from the NSF Future Directions Workshop on Speech for Robotics, held in October 2019, College Park, MD. Workshop website: https://isr.umd.edu/2019-SFRW

  24. arXiv:1911.00030  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Modeling Feature Representations for Affective Speech using Generative Adversarial Networks

    Authors: Saurabh Sahu, Rahul Gupta, Carol Espy-Wilson

    Abstract: Emotion recognition is a classic field of research with a typical setup extracting features and feeding them through a classifier for prediction. On the other hand, generative models jointly capture the distributional relationship between emotions and the feature profiles. Relatively recently, Generative Adversarial Networks (GANs) have surfaced as a new class of generative models and have shown c… ▽ More

    Submitted 31 October, 2019; originally announced November 2019.

    Journal ref: TAFFC-2019-08-0222.R2

  25. arXiv:1806.06626  [pdf, other

    cs.CL

    On Enhancing Speech Emotion Recognition using Generative Adversarial Networks

    Authors: Saurabh Sahu, Rahul Gupta, Carol Espy-Wilson

    Abstract: Generative Adversarial Networks (GANs) have gained a lot of attention from machine learning community due to their ability to learn and mimic an input data distribution. GANs consist of a discriminator and a generator working in tandem playing a min-max game to learn a target underlying data distribution; when fed with data-points sampled from a simpler distribution (like uniform or Gaussian distr… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: 5 pages, Accepted to Interspeech, Hyderabad-2018

  26. arXiv:1806.02863  [pdf, other

    cs.IR cs.CL cs.LG stat.ML

    Semi-supervised and Transfer learning approaches for low resource sentiment classification

    Authors: Rahul Gupta, Saurabh Sahu, Carol Espy-Wilson, Shrikanth Narayanan

    Abstract: Sentiment classification involves quantifying the affective reaction of a human to a document, media item or an event. Although researchers have investigated several methods to reliably infer sentiment from lexical, speech and body language cues, training a model with a small set of labeled datasets is still a challenge. For instance, in expanding sentiment analysis to new languages and cultures,… ▽ More

    Submitted 7 June, 2018; originally announced June 2018.

    Comments: 5 pages, Accepted to International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018

  27. arXiv:1806.02146  [pdf, other

    stat.ML cs.LG

    Adversarial Auto-encoders for Speech Based Emotion Recognition

    Authors: Saurabh Sahu, Rahul Gupta, Ganesh Sivaraman, Wael AbdAlmageed, Carol Espy-Wilson

    Abstract: Recently, generative adversarial networks and adversarial autoencoders have gained a lot of attention in machine learning community due to their exceptional performance in tasks such as digit classification and face recognition. They map the autoencoder's bottleneck layer output (termed as code vectors) to different noise Probability Distribution Functions (PDFs), that can be further regularized t… ▽ More

    Submitted 6 June, 2018; originally announced June 2018.

    Comments: 5 pages, INTERSPEECH 2017 August 20-24, 2017, Stockholm, Sweden