Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision
Abstract
Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively111Code is publicly available at https://github.com/modelscope/3D-Speaker, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the verification performance.
Index Terms— Speaker verification, self-supervised learning, non-contrastive methods, model collapse, self-distillation prototypes network
1 Introduction
With the great success of deep learning, speaker verification (SV) systems have achieved remarkable progress in recent years. The availability of large-scale labeled datasets is critical for the deep learning performance. However, collecting large amounts of real-world labeled SV data is laborious and expensive. Therefore, it is of great interest to explore approaches mitigating the dependence on labeled data, such as self-supervised learning (SSL).
SSL methods learn representations of data without supervision (e.g., class labels). SSL methods can be roughly categorized into contrastive [1, 2, 3, 4, 5, 6] and non-contrastive approaches [7, 8, 9, 10, 11, 12, 13, 14], depending on whether the training process involves negative samples or not. Contrastive methods require large batch sizes or special techniques such as memory banks to attain high performance. More importantly, the quality of training is highly dependent on the correctness of negative pairs. For SSL based SV, since there are no speaker labels, a common practice is to construct positive pairs from the same utterance while constructing negative pairs from different utterances. Hence within a batch, negative pairs may come from the same speaker and would be incorrect.
In contrast, non-contrastive SSL methods do not have this issue because negative samples are not required in the training process, yet they have shown comparable or better performance compared to contrastive counterparts [7, 8, 9, 10, 11, 12, 13, 14]. Non-contrastive SSL methods can be broadly classified into single-stage [7, 8, 9, 10, 11, 12] and multi-stage training methods [13, 14]. Differences between single-stage and multi-stage training methods are detailed in [11].
![Refer to caption](x1.png)
Note that the multi-stage training methods require an assumption of the rough number of speakers in the entire dataset and use a clustering algorithm to generate pseudo speaker labels for each utterance, which contradicts the original purpose of SSL. Therefore, in this work, we adopt the line of single-stage non-contrastive approaches for exploring self-supervised SV. Among prior works in this line, Sang et al. [7] propose a regularization structure inspired by BYOL [8]. In BYOL, an online network predicts a target network representation of the same utterance under different augmented views. Caron et al. [9] propose a self-distillation framework DINO that comprises a teacher and a student network. The outputs of the teacher network are used as ground truth to optimize the student network. Heo et al. [10] further introduce a curriculum learning strategy to DINO to guide model training. Due to the lack of negative pairs, these non-contrastive SSL methods are inclined to map positive pairs to the same or similar positions in the embedding space, resulting in degenerate solutions, a phenomenon known as model collapse. To alleviate model collapse, Chen et al. [11] propose two regularization terms in DINO. [12] enhance SV performance via applying augmentation strategies to DINO.
Although the application of DINO and BYOL for self-supervised speaker representation learning is promising [7, 12, 15], the key limitation of these non-contrastive methods lies in their emphasis on minimizing the distance between representations of the same utterance , while overlooking the relationship between representations of utterances from different speakers, which substantially limits their capability.
To tackle this limitation, we propose a novel non-contrastive Self-Distillation Prototypes Network (SDPN) for self-supervised speaker verification. In order to alleviate model collapse within the SDPN framework, we also introduce a regularization term to embeddings. Our contributions can be summarized as follows:
-
•
We propose a novel Self-Distillation Prototypes Network (SDPN) for self-supervised speaker verification. Our first key innovation is introducing learnable prototypes in self-distillation framework to utilize unlabeled data to learn robust speaker-discriminative embeddings. The role of prototypes is similar to that of a classifier in fully supervised learning. SDPN explicitly minimizes the distance between different augmented views of the same utterance and concurrently, implicitly differentiates the “class center” within the prototypes (which are vectors in the prototypes and similar to the center of classifier in supervised training) by assigning representations of different utterances to their corresponding basis vectors in prototypes. In this way, SDPN enhances clustering of embeddings of the same utterance while separating those from different speakers, effectively addressing the limitation of non-contrastive SSL.
-
•
Our second key innovation is proposing a diversity regularization term and integrating it within SDPN. This component computes the pairwise similarity among embeddings and actively segregates the nearest embeddings to promote the diversity of speaker embeddings within a batch. Diversity regularization further enhance the robustness of speaker embeddings.
-
•
Experiments on the VoxCeleb1 speaker verification evaluation benchmark demonstrate the superiority of our SDPN for self-supervised SV, substantially narrowing the performance gap between SSL and supervised methods for SV. SDPN sets a new state of the art (SOTA) on VoxCeleb1, achieving equal error rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the speaker verification performance.
2 Proposed Method
Fig. 1 illustrates the architecture of the proposed self-distillation framework SDPN. The upper branch in Fig. 1 presents the teacher network, while the lower branch shows the student network. Both networks share the same architecture with different parameters.
The teacher/student network comprises an encoder f for extracting speaker embeddings and a multi-layer perceptron (MLP) h for providing a non-linear transformation for these embeddings. We adopt ECAPA-TDNN [16] as the encoder. The MLP h comprises three fully connected layers with hidden dimensions of 2048-2048-256, followed by L2 normalization. The output of the first two layers passes through both batch normalization [17] and GELU activation functions [18]. The learnable prototypes c are shared between the teacher and student networks, and are used to compute the soft distributions between global and local views. Global views and local views refer to long and short segments randomly segmented from the same utterance respectively. Global views are fed into the teacher network and local views are fed into the student network. The global information learned by the teacher network guides the training of the student network, therefore enforcing the local-to-global correspondences.
Specifically, we adopt a multi-crop strategy to sample one global view and four local views from an utterance. Data augmentation has been proven to be crucial for both supervised and self-supervised representation learning. In order to adequately capture the utterance-dependent variability in speaker embeddings, we explore two kinds of augmentation strategies, WavAugment [19] and SpecAugment [20] on the local views in SDPN, while the global view is left unchanged. Then, is first encoded by the teacher’s encoder and the resulting representations are taken as speaker embeddings. Next, speaker embeddings are mapped through the teacher’s MLP . At the same time, the four local views are encoded by the student’s encoder and then mapped through the student’s MLP . The parameters of the teacher encoder and MLP are updated via an Exponential Moving Average (EMA) of the parameters of the student encoder and MLP. To train the encoders, we compute distributions of similarity between the -dimensional prototypes c and each pair of local view and global view. The cross-entropy (CE) loss is calculated to minimize the probability distribution as follows:
(1) |
(2) |
(3) |
where is cross-entropy. and denote the output probability distributions of the teacher network and the student network, where and are temperature parameters that control the sharpness of the teacher’s and student’s output distributions. denotes Sinkhorn-Knopp (SK) batch normalization [21], which helps stabilize the teacher network. Softmax is applied to the student’s output. The temperature of the teacher is lower for sharper output.
Architecture | Params (M) | VoxCeleb1-O | VoxCeleb1-E | VoxCeleb1-H | |||
EER (%) | MinDCF | EER (%) | MinDCF | EER (%) | MinDCF | ||
DINO* | 90.68 | 2.65 | 0.202 | 2.74 | 0.188 | 5.02 | 0.304 |
SDPN | 57.24 | 2.13 | 0.170 | 2.12 | 0.150 | 4.15 | 0.263 |
SDPN w/o PRO | 56.98 | 3.76 | 0.278 | 3.93 | 0.271 | 7.28 | 0.436 |
SDPN w/o PRO-2 | 57.29 | 3.75 | 0.274 | 3.90 | 0.270 | 7.22 | 0.433 |
SDPN w/ DR | 57.24 | 1.80 | 0.139 | 1.99 | 0.131 | 3.62 | 0.219 |
We further design a diversity regularization term to encourage the diversity of the embeddings within a batch. This regularizer forces the embeddings of utterances to be different and hence prevents trivial solutions, i.e., alleviating model collapse. First, we compute the pairwise similarity of speaker embeddings within a batch. Next, all the closest embeddings are separated in order to reinforce their dissimilarity. The diversity regularization loss is calculated as follows:
(4) |
Diversity regularization implicitly considers relationship between different categories and further complements SDPN. The overall loss is a combination of the CE loss and diversity regularization loss, weighted by the hyperparameter .
(5) |
3 Experiments
3.1 Experimental Setup
Datasets and Evaluation Metrics We evaluate the efficacy of the proposed approach on the commonly used VoxCeleb datasets. Specifically, we adopt the development portions of VoxCeleb2 [22] for training, which comprises 1,092,009 utterances across 5,994 speakers. No speaker labels are used in training in any experiment. We report the experimental results on three trials on VoxCeleb1 [23], in terms of two metrics, namely, the equal error rate (EER) and the minimum of the normalized detection cost function (MinDCF) with the settings of = 0.05 and = 1.
Input Features For each utterance, we use the multi-crop strategy (Section 2 for SDPN training in which 4s segments and 2s segments are taken as global views and local views, respectively.) The acoustic features used in the experiments are 80-dimensional Filter Bank (FBank) with 25ms windows and 10ms shift. Speech Activity Detection (SAD) is not performed since the training data mostly consists of continuous speech.
Data Augmentation We explore WavAugment and SpecAugment in SDPN. For WavAugment, the MUSAN corpus [24] with SNR between 0 to 15 for additive noise and Room Impulse Response (RIR) [25] for reverberation are randomly applied to each local view. For SpecAugment, one time mask and one frequency mask are randomly applied to the FBank features of the local views. The time masking length is 0 to 10 frames and the frequency masking length is 0 to 6 dimensions.
Implementation Details In our experiments, we use ECAPA-TDNN with attentive statistical pooling as the encoder. The weight decay is fixed at 5e-5. The learning rate scheduling starts with 10 warm-up epochs with a linear increase from 0 to 0.4, followed by a cosine decay with a final learning rate 1e-5. We train the model with 150 epochs using the stochastic gradient descent optimizer [26] with momentum 0.9, on 4 NVIDIA V100 GPUs. The temperature and are set to 0.04 and 0.1 respectively. The hyperparameter is set to 0.1. The dimension of each prototype is set to 256. The dimension of speaker embeddings is 512.
Model | Embedding Extractor | EER(%) |
---|---|---|
AP+AAT [27] | Fast ResNet34 | 8.65 |
MoCo + WavAug [1] | TDNN | 8.23 |
SSReg [7] | Fast ResNet34 | 6.99 |
Mixup-Aug [6] | Fast ResNet34 | 5.84 |
DINO + CL [10] | ECAPA-TDNN | 4.47 |
DINO [15] | ECAPA-TDNN | 3.30 |
RDINO [11] | ECAPA-TDNN | 3.24 |
DINO-Aug [12] | ECAPA-TDNN | 2.51 |
C3-DINO [15] | ECAPA-TDNN | 2.50 |
SDPN w/ DR | ECAPA-TDNN | 1.80 |
![Refer to caption](extracted/5694513/fig2.png)
3.2 Results and Analysis
DINO is currently the most prevalent SSL framework for SV and the majority of competitive models including SOTA C3-DINO [15] are based on DINO (as shown in Table 2); hence Table 1 compares the results from our SDPN and our reproduced baseline DINO222Our reproduced DINO achieved a better 2.65% EER on VoxCeleb1-O while the original DINO paper [15] reported 3.30% EER on this test set. on the three VoxCeleb1-{O,E,H} test sets. Comparing row 1 and 2 shows that SDPN outperforms DINO substantially and consistently across different test sets, achieving EERs of 2.13%, 2.12%, and 4.15%, with much smaller model size (57.24M), only 63% of the size of DINO (90.68M), which demonstrates the superiority of SDPN in self-supervised SV. Comparing rows 2 and 3 demonstrates that removing the prototypes causes a notable degradation in verification performance. We hypothesize that the learnable prototypes implicitly capture relationship between different classes without relying on labeled data. Since the prototypes have 0.26M parameters, to understand how much the performance degradation from row 3 over row 2 is due to the smaller model size, we evaluate a counterpart SDPN w/o PRO-2, which compensates the model size reduction from removing prototypes by increasing the number of parameters in the MLP layers by 0.31M, resulting in the same model size as SDPN. We find that SDPN w/o PRO-2 obtains EER 3.75% on VoxCeleb1-O, on par with 3.76% EER from SDPN w/o PRO, yet worse than 2.13% EER from SDPN. These results clearly demonstrate that the learnable prototypes contribute substantially to the verification performance.
Comparing rows 2 and 5 shows that adding diversity regularization (SDPN w/ DR) yields relative improvements in EERs on the three test sets by 15.49%, 6.13%, and 12.77%. Diversity regularization also consistently improves MinDCF. We employ the t-distributed Stochastic Neighbor Embedding (t-SNE) [28] to compare the disentanglement performance of speaker embeddings derived from both DINO and SDPN w/ DR, as illustrated in Fig. 2. It is clear that the embeddings extracted via SDPN w/ DR exhibit superior clustering capabilities compared to those from DINO, suggesting that SDPN makes speaker embeddings more discriminative.
Weight | VoxCeleb1-O | VoxCeleb1-E | VoxCeleb1-H | |||
---|---|---|---|---|---|---|
EER | MinDCF | EER | MinDCF | EER | MinDCF | |
0 | 2.13% | 0.170 | 2.12% | 0.150 | 4.15% | 0.263 |
0.05 | 1.94% | 0.134 | 2.02% | 0.132 | 3.67% | 0.227 |
0.1 | 1.80% | 0.139 | 1.99% | 0.131 | 3.62% | 0.219 |
0.2 | 1.95% | 0.157 | 2.08% | 0.136 | 3.74% | 0.223 |
VoxCeleb1-O | VoxCeleb1-E | VoxCeleb1-H | ||||
---|---|---|---|---|---|---|
EER | MinDCF | EER | MinDCF | EER | MinDCF | |
No Aug. | 3.62% | 0.268 | 4.34% | 0.266 | 6.99% | 0.407 |
WavAug. | 1.88% | 0.153 | 2.06% | 0.139 | 3.72% | 0.228 |
SpecAug. | 4.27% | 0.317 | 5.03% | 0.309 | 7.69% | 0.447 |
Wav+Spec. | 1.80% | 0.139 | 1.99% | 0.131 | 3.62% | 0.219 |
We compare our SDPN to recently proposed non-contrastive SSL approaches, including [27, 6, 1, 7, 10, 11, 12], and the current SSL SOTA C3-DINO [15] which integrates contrastive and non-contrastive methods, as shown in the first group in Table 2. On VoxCeleb1-O test set, our non-contrastive SDPN with diversity regularization achieves 1.80% EER with a simple cosine distance scoring method, outperforming the SOTA C3-DINO (2.50% EER) by 28.0% relative improvement.
We analyze the impact of the weight of diversity regularization on the performance, as shown in Table 3. We observe that applying diversity regularization outperforms DINO with even a small weight . SDPN w/ DR achieves 1.80% EER with .
Additionally, we study the effect of different data augmentation strategies on the training data, as shown in Table 4. We find that WavAugment notably improves SDPN performance and when no augmentation is applied in training, it is difficult for the entire network to converge due to the inherent property of a non-contrastive framework. Without WavAugment, SpecAugment degrades the performance since it uses erasing operation on the acoustic features. Yet Combining WavAugment and SpecAugment outperforms WavAugment, suggesting that these augmentation strategies are complementary.
4 Conclusion
We propose a novel self-distillation prototypes network (SDPN) with diversity regularization (DR) for self-supervised learning for speaker verification (SV), utilizing unlabeled data to learn robust speaker-discriminative embeddings. SDPN enhances clustering of embeddings from the same utterance while separating those from different speakers, addressing limitations of traditional non-contrastive methods. Diversity regularization alleviates the model collapse problem in non-contrastive frameworks. Comprehensive experiments demonstrate the superiority of SDPN, which further narrows the performance gap between SSL and supervised methods for speaker verification.
References
- [1] Wei Xia, Chunlei Zhang, Chao Weng, Meng Yu, and Dong Yu, “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in ICASSP 2021. 2021, pp. 6723–6727, IEEE.
- [2] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML 2020. 2020, vol. 119 of Proceedings of Machine Learning Research, pp. 1597–1607, PMLR.
- [3] Aäron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.
- [4] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR 2020. 2020, pp. 9726–9735, Computer Vision Foundation / IEEE.
- [5] Haoran Zhang, Yuexian Zou, and Helin Wang, “Contrastive self-supervised learning for text-independent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 6713–6717, IEEE.
- [6] Xin Zhang, Minho **, Roger Cheng, Ruirui Li, Eunjung Han, and Andreas Stolcke, “Contrastive-mixup learning for improved speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 2022, pp. 7652–7656, IEEE.
- [7] Mufan Sang, Haoqi Li, Fang Liu, Andrew O. Arnold, and Li Wan, “Self-supervised speaker verification with simple siamese network and self-supervised regularization,” in ICASSP 2022. 2022, pp. 6127–6131, IEEE.
- [8] Jean-Bastien Grill et al., “Bootstrap your own latent - A new approach to self-supervised learning,” in NeurIPS 2020, 2020.
- [9] Mathilde Caron et al., “Emerging properties in self-supervised vision transformers,” in ICCV 2021. 2021, pp. 9630–9640, IEEE.
- [10] Hee-Soo et al. Heo, “Self-supervised curriculum learning for speaker verification,” arXiv preprint arXiv:2203.14525, 2022.
- [11] Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, and Qian Chen, “Pushing the limits of self-supervised speaker verification using regularized distillation framework,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
- [12] Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, and Michael Zeng, “A comprehensive study on self-supervised distillation for speaker representation learning,” in SLT 2022. 2022, pp. 599–604, IEEE.
- [13] Bing Han, Zhengyang Chen, and Yanmin Qian, “Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification,” CoRR, vol. abs/2304.05754, 2023.
- [14] Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, and Haizhou Li, “Self-supervised speaker recognition with loss-gated learning,” in ICASSP 2022. 2022, pp. 6142–6146, IEEE.
- [15] Chunlei Zhang and Dong Yu, “C3-DINO: joint contrastive and non-contrastive self-supervised learning for speaker verification,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1273–1283, 2022.
- [16] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech 2020. 2020, pp. 3830–3834, ISCA.
- [17] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Francis R. Bach and David M. Blei, Eds. 2015, vol. 37 of JMLR Workshop and Conference Proceedings, pp. 448–456, JMLR.org.
- [18] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter, “Self-normalizing neural networks,” in NIPS 2017, 2017, pp. 971–980.
- [19] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018. 2018, pp. 5329–5333, IEEE.
- [20] Daniel S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech 2019. 2019, pp. 2613–2617, ISCA.
- [21] Marco Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in NIPS 2013, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, Eds., 2013, pp. 2292–2300.
- [22] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech. 2018, pp. 1086–1090, ISCA.
- [23] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Interspeech 2017. 2017, pp. 2616–2620, ISCA.
- [24] David Snyder, Guoguo Chen, and Daniel Povey, “MUSAN: A music, speech, and noise corpus,” CoRR, vol. abs/1510.08484, 2015.
- [25] Tom Ko et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP 2017. 2017, pp. 5220–5224, IEEE.
- [26] Shun-ichi Amari, “Backpropagation and stochastic gradient descent method,” Neurocomputing, vol. 5, no. 3, pp. 185–196, 1993.
- [27] **gu Kang, Jaesung Huh, Hee Soo Heo, and Joon Son Chung, “Augmentation adversarial training for self-supervised speaker representation learning,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1253–1262, 2022.
- [28] Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.