-
Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models
Authors:
Chenyang Gao,
Brecht Desplanques,
Chelsea J. -T. Ju,
Aman Chadha,
Andreas Stolcke
Abstract:
Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtim…
▽ More
Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enrollment-verification framework that uses different models for enrollment and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.
△ Less
Submitted 22 January, 2024;
originally announced January 2024.
-
Adversarial Reweighting for Speaker Verification Fairness
Authors:
Minho **,
Chelsea J. -T. Ju,
Zeya Chen,
Yi-Chieh Liu,
Jasha Droppo,
Andreas Stolcke
Abstract:
We address performance fairness for speaker verification using the adversarial reweighting (ARW) method. ARW is reformulated for speaker verification with metric learning, and shown to improve results across different subgroups of gender and nationality, without requiring annotation of subgroups in the training data. An adversarial network learns a weight for each training sample in the batch so t…
▽ More
We address performance fairness for speaker verification using the adversarial reweighting (ARW) method. ARW is reformulated for speaker verification with metric learning, and shown to improve results across different subgroups of gender and nationality, without requiring annotation of subgroups in the training data. An adversarial network learns a weight for each training sample in the batch so that the main learner is forced to focus on poorly performing instances. Using a min-max optimization algorithm, this method improves overall speaker verification fairness. We present three different ARWformulations: accumulated pairwise similarity, pseudo-labeling, and pairwise weighting, and measure their performance in terms of equal error rate (EER) on the VoxCeleb corpus. Results show that the pairwise weighting method can achieve 1.08% overall EER, 1.25% for male and 0.67% for female speakers, with relative EER reductions of 7.7%, 10.1% and 3.0%, respectively. For nationality subgroups, the proposed algorithm showed 1.04% EER for US speakers, 0.76% for UK speakers, and 1.22% for all others. The absolute EER gap between gender groups was reduced from 0.70% to 0.58%, while the standard deviation over nationality groups decreased from 0.21 to 0.19.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
Authors:
Ruirui Li,
Chelsea J. -T. Ju,
Zeya Chen,
Hongda Mao,
Oguz Elibol,
Andreas Stolcke
Abstract:
By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shop** checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of mod…
▽ More
By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shop** checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.
△ Less
Submitted 18 June, 2021;
originally announced June 2021.
-
Non-local convolutional neural networks (nlcnn) for speaker recognition
Authors:
Haici Yang,
Hongda Mao,
Ruirui Li,
Chelsea J. T. Ju,
Oguz Elibol
Abstract:
Speaker recognition is the process of identifying a speaker based on the voice. The technology has attracted more attention with the recent increase in popularity of smart voice assistants, such as Amazon Alexa. In the past few years, various convolutional neural network (CNN) based speaker recognition algorithms have been proposed and achieved satisfactory performance. However, convolutional oper…
▽ More
Speaker recognition is the process of identifying a speaker based on the voice. The technology has attracted more attention with the recent increase in popularity of smart voice assistants, such as Amazon Alexa. In the past few years, various convolutional neural network (CNN) based speaker recognition algorithms have been proposed and achieved satisfactory performance. However, convolutional operations are building blocks that typically perform on a local neighborhood at a time and thus miss to capture global, long-range interactions at the feature level which are critical for understanding the pattern in a speaker's voice. In this work, we propose to apply Non-local Convolutional Neural Networks (NLCNN) to improve the capability of capturing long-range dependencies at the feature level, therefore improving speaker recognition performance. Specifically, we introduce non-local blocks where the output response of a position is computed as a weighted sum of the input features at all positions. Combining non-local blocks with pre-defined CNN networks, we investigate the effectiveness of NLCNN models. Without extensive tuning, the proposed NLCNN models outperform state-of-the-art speaker recognition algorithms on the public Voxceleb dataset. What's more, we investigate different types of non-local operations applied to the frequency-time domain, time domain, frequency domain and frame-level respectively. Among them, time domain is the most effective one for speaker recognition applications.
△ Less
Submitted 19 May, 2021; v1 submitted 6 November, 2020;
originally announced November 2020.