-
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
Authors:
Zakaria Aldeneh,
Takuya Higuchi,
Jee-weon Jung,
Skyler Seto,
Tatiana Likhomanenko,
Stephen Shum,
Ahmed Hussen Abdelaziz,
Shinji Watanabe,
Barry-John Theobald
Abstract:
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe…
▽ More
Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised speech features inherently include information required for downstream speaker verification task, and therefore, we can simplify the downstream model without sacrificing performance. To this end, we revisit the design of the downstream model for speaker verification using self-supervised features. We show that we can simplify the model to use 97.51% fewer parameters while achieving a 29.93% average improvement in performance on SUPERB. Consequently, we show that the simplified downstream model is more data efficient compared to baseline--it achieves better performance with only 60% of the training data.
△ Less
Submitted 13 June, 2024; v1 submitted 1 February, 2024;
originally announced February 2024.
-
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Authors:
Jee-weon Jung,
Wangyou Zhang,
Jiatong Shi,
Zakaria Aldeneh,
Takuya Higuchi,
Barry-John Theobald,
Ahmed Hussen Abdelaziz,
Shinji Watanabe
Abstract:
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also…
▽ More
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.
△ Less
Submitted 13 June, 2024; v1 submitted 30 January, 2024;
originally announced January 2024.
-
Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning
Authors:
Miguel Sarabia,
Elena Menyaylenko,
Alessandro Toso,
Skyler Seto,
Zakaria Aldeneh,
Shadi Pirhosseinloo,
Luca Zappella,
Barry-John Theobald,
Nicholas Apostoloff,
Jonathan Sheaffer
Abstract:
We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulate…
▽ More
We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60° on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43° on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation on ACE Challenge.
△ Less
Submitted 18 August, 2023;
originally announced August 2023.
-
Naturalistic Head Motion Generation from Speech
Authors:
Trisha Mittal,
Zakaria Aldeneh,
Masha Fedzechkina,
Anurag Ranjan,
Barry-John Theobald
Abstract:
Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the varia…
▽ More
Synthesizing natural head motion to accompany speech for an embodied conversational agent is necessary for providing a rich interactive experience. Most prior works assess the quality of generated head motion by comparing them against a single ground-truth using an objective metric. Yet there are many plausible head motion sequences to accompany a speech utterance. In this work, we study the variation in the perceptual quality of head motions sampled from a generative model. We show that, despite providing more diverse head motions, the generative model produces motions with varying degrees of perceptual quality. We finally show that objective metrics commonly used in previous research do not accurately reflect the perceptual quality of generated head motions. These results open an interesting avenue for future work to investigate better objective metrics that correlate with human perception of quality.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
On the role of Lip Articulation in Visual Speech Perception
Authors:
Zakaria Aldeneh,
Masha Fedzechkina,
Skyler Seto,
Katherine Metcalf,
Miguel Sarabia,
Nicholas Apostoloff,
Barry-John Theobald
Abstract:
Generating realistic lip motion from audio to simulate speech production is critical for driving natural character animation. Previous research has shown that traditional metrics used to optimize and assess models for generating lip motion from speech are not a good indicator of subjective opinion of animation quality. Devising metrics that align with subjective opinion first requires understandin…
▽ More
Generating realistic lip motion from audio to simulate speech production is critical for driving natural character animation. Previous research has shown that traditional metrics used to optimize and assess models for generating lip motion from speech are not a good indicator of subjective opinion of animation quality. Devising metrics that align with subjective opinion first requires understanding what impacts human perception of quality. In this work, we focus on the degree of articulation and run a series of experiments to study how articulation strength impacts human perception of lip motion accompanying speech. Specifically, we study how increasing under-articulated (dampened) and over-articulated (exaggerated) lip motion affects human perception of quality. We examine the impact of articulation strength on human perception when considering only lip motion, where viewers are presented with talking faces represented by landmarks, and in the context of embodied characters, where viewers are presented with photo-realistic videos. Our results show that viewers prefer over-articulated lip motion consistently more than under-articulated lip motion and that this preference generalizes across different speakers and embodiments.
△ Less
Submitted 10 November, 2022; v1 submitted 18 March, 2022;
originally announced March 2022.
-
Aphasic Speech Recognition using a Mixture of Speech Intelligibility Experts
Authors:
Matthew Perez,
Zakaria Aldeneh,
Emily Mower Provost
Abstract:
Robust speech recognition is a key prerequisite for semantic feature extraction in automatic aphasic speech analysis. However, standard one-size-fits-all automatic speech recognition models perform poorly when applied to aphasic speech. One reason for this is the wide range of speech intelligibility due to different levels of severity (i.e., higher severity lends itself to less intelligible speech…
▽ More
Robust speech recognition is a key prerequisite for semantic feature extraction in automatic aphasic speech analysis. However, standard one-size-fits-all automatic speech recognition models perform poorly when applied to aphasic speech. One reason for this is the wide range of speech intelligibility due to different levels of severity (i.e., higher severity lends itself to less intelligible speech). To address this, we propose a novel acoustic model based on a mixture of experts (MoE), which handles the varying intelligibility stages present in aphasic speech by explicitly defining severity-based experts. At test time, the contribution of each expert is decided by estimating speech intelligibility with a speech intelligibility detector (SID). We show that our proposed approach significantly reduces phone error rates across all severity stages in aphasic speech compared to a baseline approach that does not incorporate severity information into the modeling process.
△ Less
Submitted 24 August, 2020;
originally announced August 2020.
-
On the Role of Visual Cues in Audiovisual Speech Enhancement
Authors:
Zakaria Aldeneh,
Anushree Prasanna Kumar,
Barry-John Theobald,
Erik Marchi,
Sachin Kajarekar,
Devang Naik,
Ahmed Hussen Abdelaziz
Abstract:
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of…
▽ More
We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual cues provide not only high-level information about speech activity, i.e., speech/silence, but also fine-grained visual information about the place of articulation. One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual embeddings for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.
△ Less
Submitted 25 February, 2021; v1 submitted 24 April, 2020;
originally announced April 2020.
-
Identifying Mood Episodes Using Dialogue Features from Clinical Interviews
Authors:
Zakaria Aldeneh,
Mimansa Jaiswal,
Michael Picheny,
Melvin McInnis,
Emily Mower Provost
Abstract:
Bipolar disorder, a severe chronic mental illness characterized by pathological mood swings from depression to mania, requires ongoing symptom severity tracking to both guide and measure treatments that are critical for maintaining long-term health. Mental health professionals assess symptom severity through semi-structured clinical interviews. During these interviews, they observe their patients'…
▽ More
Bipolar disorder, a severe chronic mental illness characterized by pathological mood swings from depression to mania, requires ongoing symptom severity tracking to both guide and measure treatments that are critical for maintaining long-term health. Mental health professionals assess symptom severity through semi-structured clinical interviews. During these interviews, they observe their patients' spoken behaviors, including both what the patients say and how they say it. In this work, we move beyond acoustic and lexical information, investigating how higher-level interactive patterns also change during mood episodes. We then perform a secondary analysis, asking if these interactive patterns, measured through dialogue features, can be used in conjunction with acoustic features to automatically recognize mood episodes. Our results show that it is beneficial to consider dialogue features when analyzing and building automated systems for predicting and monitoring mood.
△ Less
Submitted 24 March, 2022; v1 submitted 28 September, 2019;
originally announced October 2019.
-
Controlling for Confounders in Multimodal Emotion Classification via Adversarial Learning
Authors:
Mimansa Jaiswal,
Zakaria Aldeneh,
Emily Mower Provost
Abstract:
Various psychological factors affect how individuals express emotions. Yet, when we collect data intended for use in building emotion recognition systems, we often try to do so by creating paradigms that are designed just with a focus on eliciting emotional behavior. Algorithms trained with these types of data are unlikely to function outside of controlled environments because our emotions natural…
▽ More
Various psychological factors affect how individuals express emotions. Yet, when we collect data intended for use in building emotion recognition systems, we often try to do so by creating paradigms that are designed just with a focus on eliciting emotional behavior. Algorithms trained with these types of data are unlikely to function outside of controlled environments because our emotions naturally change as a function of these other factors. In this work, we study how the multimodal expressions of emotion change when an individual is under varying levels of stress. We hypothesize that stress produces modulations that can hide the true underlying emotions of individuals and that we can make emotion recognition algorithms more generalizable by controlling for variations in stress. To this end, we use adversarial networks to decorrelate stress modulations from emotion representations. We study how stress alters acoustic and lexical emotional predictions, paying special attention to how modulations due to stress affect the transferability of learned emotion recognition models across domains. Our results show that stress is indeed encoded in trained emotion classifiers and that this encoding varies across levels of emotions and across the lexical and acoustic modalities. Our results also show that emotion recognition models that control for stress during training have better generalizability when applied to new domains, compared to models that do not control for stress during training. We conclude that is is necessary to consider the effect of extraneous psychological factors when building and testing emotion recognition models.
△ Less
Submitted 23 August, 2019;
originally announced August 2019.
-
MuSE-ing on the Impact of Utterance Ordering On Crowdsourced Emotion Annotations
Authors:
Mimansa Jaiswal,
Zakaria Aldeneh,
Cristian-Paul Bara,
Yuanhang Luo,
Mihai Burzo,
Rada Mihalcea,
Emily Mower Provost
Abstract:
Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are inherently subjective. There is generally not a single annotation that can be unambiguously declared "correct". As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact o…
▽ More
Emotion recognition algorithms rely on data annotated with high quality labels. However, emotion expression and perception are inherently subjective. There is generally not a single annotation that can be unambiguously declared "correct". As a result, annotations are colored by the manner in which they were collected. In this paper, we conduct crowdsourcing experiments to investigate this impact on both the annotations themselves and on the performance of these algorithms. We focus on one critical question: the effect of context. We present a new emotion dataset, Multimodal Stressed Emotion (MuSE), and annotate the dataset using two conditions: randomized, in which annotators are presented with clips in random order, and contextualized, in which annotators are presented with clips in order. We find that contextual labeling schemes result in annotations that are more similar to a speaker's own self-reported labels and that labels generated from randomized schemes are most easily predictable by automated systems.
△ Less
Submitted 27 March, 2019;
originally announced March 2019.