Emphasized Non-Target Speaker Knowledge in Knowledge Distillation
for Automatic Speaker Verification
Abstract
Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers’ knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods***Code and models are available at github.com/ductuantruong/enskd.
Index Terms— knowledge distillation, label-level knowledge distillation, automatic speaker verification
1 Introduction
Automatic speaker verification (ASV) is the process of authenticating an individual’s claimed identity based on voice characteristics. By leveraging large-scale neural networks trained on abundant unlabelled speech data, self-supervised learning (SSL) models have revolutionized various speech processing tasks [1, 2, 3], including ASV [4, 5]. However, these models are computationally expensive. To better utilize SSL models, knowledge distillation can be employed to transfer the robust speech representation to smaller student models. In ASV, KD encompasses two common approaches: one is embedding-level method [6, 7, 8, 9], which attempts to make student models mimic the teacher’s intermediate feature embedding by reducing the distance between representation spaces; the other is label-level method [6, 10], which focuses on minimizing the Kullback–Leibler divergence between the output probabilities of the teacher and student networks.
In the training step of an ASV model, the objective is to classify input speech into target speaker (the ground-truth speaker) and avoid assigning it to non-target speakers (incorrect speakers). While the importance of the target speaker is evident, non-target speakers can also enhance the model’s discriminability since there would be numerous non-target speakers sharing similar voice characteristics with the target speaker. In past studies, [11] compared ASV models performance trained on two training sets with different numbers of speakers but the same number of utterances, and found that a larger number of speakers improved the performance. Similarly, in face recognition, [12] also observed that an increasing number of training non-target classes improved model performance within a fixed-size training set. Building on these observations, we hypothesize that integrating knowledge from non-target speakers can enhance ASV model performance. However, the conventional label-level KD considers correlations among the teacher’s output probabilities of all speakers, the importance of non-target speakers’ probabilities can be overshadowed by the target speaker with high classification confidence in the teacher model. Based on this hypothesis, the conventional label-level KD approach for ASV can be improved by emphasizing the knowledge of non-target speakers.
To validate the assumptions above, this paper initially shows an experiment illustrating the importance of non-target speakers in ASV. When the number of training utterances remains the same, we observe that an increasing number of non-target training speakers leads to better results. Based on this observation, we investigate the significance of non-target speakers in the conventional label-level KD for ASV models. Following Decoupled Knowledge Distillation (DKD) [13], we segregate the output classification probabilities of the teacher and student models into two distinct probabilities of target and non-target speakers. Subsequently, the probabilities of non-target speakers are emphasized during KD using a specific weight. We utilize the large-scale SSL model WavLM-TDNN [14] as our teacher model and employ three different network architectures: x-vector [15], ResNet34 [16], and CAM++ [17] as student models. Our experiments show that DKD with an emphasis on the non-target speakers’ output probabilities, outperforms both embedding-level and conventional label-level KD methods across student models.
2 Methodology
2.1 The impact of non-target speakers for ASV
![Refer to caption](x1.png)
To validate the hypothesis that a larger set of non-target speakers benefits ASV models, we conducted a toy experiment. We trained the x-vector model using a fixed 100,000 training utterances of the VoxCeleb 2 dev set [18]. These training utterances are evenly distributed among each training speaker, hence increasing the number of speakers will lead to fewer training utterances per speaker. As depicted in Figure 1, the performance of the x-vector model consistently improves with an increasing number of speakers in the training set. This indicates that involving more non-target speakers enhances the model’s ability to distinguish the target speaker from others. Inspired by this finding, we further extract and emphasize non-target speaker knowledge during the knowledge distillation process.
![Refer to caption](x2.png)
2.2 Rethinking conventional label-level KD
Following the reformulation of the conventional label-level KD for the computer vision task in [13], we interpret the conventional label-level KD loss for automatic speaker verification. In the training phase of the ASV model, the model’s output is classification probabilities over the set of K training speakers, in which the probability of the i-th speaker is computed using the softmax function to transform the logits vector into a probability distribution as follows:
(1) |
In the conventional label-level KD, the student model tries to mimic the teacher model by minimizing the Kullback-Leibler Divergence between the student () and teacher () output probability distributions. The loss is defined as:
(2) |
where denote the output probabilities of the teacher and student networks, respectively. We further split the set of indexes in into the target speaker and a set of non-target speakers as:
(3) |
We define the probability of classifying a speaker belonging to as , and the probability of predicting a specific non-target speaker over all non-target speakers as :
(4) |
From (1) and (4), we replace in (3):
(5) |
Since , are independent to the class index i and , we can simplify (2.2) to:
(6) |
From (6), the conventional label-level KD can be re-formulated into the sum of two terms: 1) Target Speaker Knowledge Distillation (TSKD) loss : the over the binary classification probability of the target speaker and all non-target speakers, and 2) Non-Target Speaker Knowledge Distillation (NSKD) loss : the of the multi-class classification probability between non-target speakers as shown in (7) and Fig.2.
(7) |
From the above equation, when the teacher model predicts the target speaker accurately, a large value of results in a smaller , which leads to the suppression of . This could potentially hinder the distillation of knowledge from non-target speakers in the label-level KD method.
2.3 Decoupled Knowledge Distillation with an emphasis on non-target speaker knowledge
In Section 2.1, it was demonstrated that leveraging more non-target speaker knowledge can enhance the performance of ASV models. In other words, may play a crucial role in the knowledge transfer from the teacher to student models. Decoupled Knowledge Distillation (DKD) [13] proposed a modification to remove the dependency factor in (7) by introducing hyperparameters to balance the and . However, to place a greater emphasis on , we adjusted the original DKD method by simply replacing with the hyperparameter in the following manner:
(8) |
Finally, the DKD loss is combined with the classification loss to optimize the student model. Fig. 2 illustrates the comparison between DKD with an emphasis on non-target speaker knowledge, embedding-level, and conventional label-level knowledge distillation.
System | Params (M) | FLOPs (G) | Distillation Method | EER (%) / minDCF | ||||
Vox1-O | Vox1-E | Vox1-H | ||||||
|
316.62 | 26 | - | 0.383 / - | 0.480 / - | 0.986 / - | ||
TDNN-based Student model x-vector [15] | 4.61 | 0.53 | - | 1.835 / - | 1.822 / - | 3.110 / - | ||
1.760 / 0.189 | 1.742 / 0.185 | 2.879 / 0.255 | ||||||
1.585 / 0.171 | 1.589 / 0.171 | 2.704 / 0.244 | ||||||
Ours | 1.319 / 0.160 | 1.388 / 0.155 | 2.440 / 0.226 | |||||
CNN-based Student model ResNet34 [16] | 6.64 | 4.55 | - | 0.862 / 0.089 | 1.035 / 0.112 | 1.827 / 0.176 | ||
0.829 / 0.088 | 0.943 / 0.107 | 1.694 / 0.164 | ||||||
0.771 / 0.086 | 0.939 / 0.103 | 1.728 / 0.166 | ||||||
Ours | 0.766 / 0.101 | 0.850 / 0.096 | 1.615 / 0.161 | |||||
D-TDNN-based Student model CAM++[17] | 7.18 | 1.72 | - | 0.718 / - | 0.879 / - | 1.735 / - | ||
0.713 / 0.118 | 0.901 / 0.108 | 1.768 / 0.182 | ||||||
0.633 / 0.101 | 0.790 / 0.093 | 1.572 / 0.159 | ||||||
Ours | 0.590 / 0.118 | 0.735 / 0.085 | 1.494 / 0.148 |
3 Experiments Setup
3.1 Dataset
3.2 Model
3.3 Training and Evaluation
During the training, each audio sample was randomly cropped to a 2-second segment, then 80-dimensional Fbank features were extracted using a frame length of 25 ms and a frameshift of 10 ms. For the classification loss function, we employed the AAM-softmax [22] with a scale of 32 and a margin scheduler. In the proposed KD method, from the ablation study in Section 4.2, the value in (8) is set to 2.0 in all the remaining experiments. For evaluation, speaker embeddings were scored using cosine similarity and score normalization. Performance is reported on two metrics: Equal Error Rate (EER) and the minimum of the normalized detection cost function (MinDCF) with and . All experiments are conducted using Wespeaker toolkit [23].
4 Results and Analysis
4.1 Results of the proposed method
Table 1 presents a comparison of the performance of teacher and student models trained solely with classification loss, along with the results of different knowledge distillation methods. Although both embedding-level and conventional label-level KD methods outperform student networks trained solely with classification loss, the improvement remains limited. Under the limited number of parameters and floating-point operations (FLOPs), the smallest student model x-vector using DKD emphasizing non-target speakers exhibits the largest improvement of 28.12% in Vox1-O EER, compared to its baseline trained with classification loss only. Moreover, our proposed method enables the state-of-the-art CAM++ model to further boost its performance with an EER of 0.590%, while the model’s size and FLOPs are respectively 97.73% and 93.39% smaller than the teacher model. Lastly, all three student networks trained using our proposed method have a better result than the embedding-level and conventional label-level KD methods, especially in challenging sets like Vox1-E and Vox1-H. This indicates that DKD emphasizing non-target speaker probabilities effectively improves the performance of student models.
4.2 Ablation Study: The impact of
We conducted an ablation study on the hyperparameters in the DKD formula to show how the robustness of the proposed method varies. To save computational cost, we solely present the results of the student model x-vector, which are summarized in Table 2. When emphasizing the with non-zero values of , all the results exceed the performance of , which is equivalent to the result of the conventional label-level knowledge distillation method. It is observed that removing by assigning obtains a worse result than the conventional label-level knowledge distillation. In alignment with the findings from Section 2.1, an increasing value of leads to better performance, implying the increased significance of . Notably, the best hyperparameter configuration of achieved an average of 13% improvements in EER compared to the conventional label-level knowledge distillation.
EER (%) / minDCF | |||
Vox1-O | Vox1-E | Vox1-H | |
1.585 / 0.171 | 1.589 / 0.171 | 2.704 / 0.244 | |
1.622 / 0.152 | 1.646 / 0.175 | 2.786 / 0.252 | |
1.463 / 0.166 | 1.452 / 0.155 | 2.520 / 0.225 | |
1.319 / 0.160 | 1.388 / 0.155 | 2.440 / 0.226 | |
1.361 / 0.143 | 1.415 / 0.156 | 2.511 / 0.229 |
5 Conclusion
This paper has shown the benefit of leveraging non-target speakers for training automatic speaker verification models. Based on this finding, we modified the conventional label-level KD to emphasize the classification probabilities of non-target speakers, which involves splitting and amplifying the non-target speaker’s probabilities during the knowledge distillation process. Experimental results on the VoxCeleb test sets show an average of 13.67% improvement in EER of the proposed method compared to other knowledge distillation methods across three different architecture student models.
6 Acknowledgement
The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg).
References
- [1] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio, “Multi-task self-supervised learning for robust speech recognition,” in IEEE ICASSP, 2020, pp. 6989–6993.
- [2] Wei-Tsung Kao, Yuan-Kuei Wu, Chia-** Chen, Zhi-Sheng Chen, Yu-Pao Tsai, and Hung-Yi Lee, “On the efficiency of integrating self-supervised learning and meta-learning for user-defined few-shot keyword spotting,” in SLT, 2023, pp. 414–421.
- [3] Tarun Gupta, Duc-Tuan Truong, Tran The Anh, and Chng Eng Siong, “Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model,” in Proc. INTERSPEECH, 2022, pp. 1978–1982.
- [4] Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, and Haizhou Li, “Self-supervised speaker recognition with loss-gated learning,” in IEEE ICASSP, 2022, pp. 6142–6146.
- [5] Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, and Haizhou Li, “Disentangling voice and content with self-supervision for speaker recognition,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- [6] Shuai Wang, Yexin Yang, Tianzhe Wang, Yanmin Qian, and Kai Yu, “Knowledge distillation for small foot-print deep speaker embedding,” in IEEE ICASSP, 2019, pp. 6021–6025.
- [7] Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, and Guanglu Wan, “Label-free knowledge distillation with contrastive loss for light-weight speaker recognition,” in ISCSLP, 2022, pp. 324–328.
- [8] Xuechen Liu, Md Sahidullah, and Tomi Kinnunen, “Distilling multi-level x-vector knowledge for small-footprint speaker verification,” in arXiv preprint arXiv:2303.01125, 2023.
- [9] Jungwoo Heo, Chan yeong Lim, Ju ho Kim, Hyun seo Shin, and Ha-** Yu, “One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification,” in Proc. INTERSPEECH, 2023, pp. 5271–5275.
- [10] Leying Zhang, Zhengyang Chen, and Yanmin Qian, “Knowledge distillation from multi-modality to single-modality for person verification,” in Proc. INTERSPEECH, 2021, pp. 1897–1901.
- [11] Nik Vaessen and David van Leeuwen, “Training speaker recognition systems with limited data,” in Proc. Interspeech 2022, 2022, pp. 4760–4764.
- [12] Yaobin Zhang and Weihong Deng, “Class-balanced training for deep face recognition,” in IEEE/CVF CVPRW, 2020, pp. 3594–3603.
- [13] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang, “Decoupled knowledge distillation,” in IEEE/CVF CVPR, 2022, pp. 11943–11952.
- [14] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, **yu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- [15] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in IEEE ICASSP, 2018, pp. 5329–5333.
- [16] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, and Oldřich Plchot, “BUT system description to VoxCeleb speaker recognition challenge 2019,” in arXiv preprint arXiv:1910.12592, 2019.
- [17] Hui Wang, Siqi Zheng, Yafeng Chen, Luyao Cheng, and Qian Chen, “CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,” in Proc. INTERSPEECH, 2023, pp. 5301–5305.
- [18] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman, “VoxCeleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
- [19] David Snyder, Guoguo Chen, and Daniel Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
- [20] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE ICASSP, 2017, pp. 5220–5224.
- [21] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. INTERSPEECH, 2020, pp. 3830–3834.
- [22] Jiankang Deng et al, “ArcFace: Additive angular margin loss for deep face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 5962–5979, oct 2022.
- [23] Hongji Wang, Chengdong Liang, Shuai Wang, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, and Yanmin Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE ICASSP, 2023, pp. 1–5.