License: arXiv.org perpetual non-exclusive license
arXiv:2309.14838v2 [cs.SD] 15 Jan 2024

Emphasized Non-Target Speaker Knowledge in Knowledge Distillation
for Automatic Speaker Verification

Abstract

Knowledge distillation (KD) is used to enhance automatic speaker verification performance by ensuring consistency between large teacher networks and lightweight student networks at the embedding level or label level. However, the conventional label-level KD overlooks the significant knowledge from non-target speakers, particularly their classification probabilities, which can be crucial for automatic speaker verification. In this paper, we first demonstrate that leveraging a larger number of training non-target speakers improves the performance of automatic speaker verification models. Inspired by this finding about the importance of non-target speakers’ knowledge, we modified the conventional label-level KD by disentangling and emphasizing the classification probabilities of non-target speakers during knowledge distillation. The proposed method is applied to three different student model architectures and achieves an average of 13.67% improvement in EER on the VoxCeleb dataset compared to embedding-level and conventional label-level KD methods***Code and models are available at github.com/ductuantruong/enskd.

Index Terms—  knowledge distillation, label-level knowledge distillation, automatic speaker verification

1 Introduction

Automatic speaker verification (ASV) is the process of authenticating an individual’s claimed identity based on voice characteristics. By leveraging large-scale neural networks trained on abundant unlabelled speech data, self-supervised learning (SSL) models have revolutionized various speech processing tasks [1, 2, 3], including ASV [4, 5]. However, these models are computationally expensive. To better utilize SSL models, knowledge distillation can be employed to transfer the robust speech representation to smaller student models. In ASV, KD encompasses two common approaches: one is embedding-level method [6, 7, 8, 9], which attempts to make student models mimic the teacher’s intermediate feature embedding by reducing the distance between representation spaces; the other is label-level method [6, 10], which focuses on minimizing the Kullback–Leibler divergence between the output probabilities of the teacher and student networks.

In the training step of an ASV model, the objective is to classify input speech into target speaker (the ground-truth speaker) and avoid assigning it to non-target speakers (incorrect speakers). While the importance of the target speaker is evident, non-target speakers can also enhance the model’s discriminability since there would be numerous non-target speakers sharing similar voice characteristics with the target speaker. In past studies, [11] compared ASV models performance trained on two training sets with different numbers of speakers but the same number of utterances, and found that a larger number of speakers improved the performance. Similarly, in face recognition, [12] also observed that an increasing number of training non-target classes improved model performance within a fixed-size training set. Building on these observations, we hypothesize that integrating knowledge from non-target speakers can enhance ASV model performance. However, the conventional label-level KD considers correlations among the teacher’s output probabilities of all speakers, the importance of non-target speakers’ probabilities can be overshadowed by the target speaker with high classification confidence in the teacher model. Based on this hypothesis, the conventional label-level KD approach for ASV can be improved by emphasizing the knowledge of non-target speakers.

To validate the assumptions above, this paper initially shows an experiment illustrating the importance of non-target speakers in ASV. When the number of training utterances remains the same, we observe that an increasing number of non-target training speakers leads to better results. Based on this observation, we investigate the significance of non-target speakers in the conventional label-level KD for ASV models. Following Decoupled Knowledge Distillation (DKD) [13], we segregate the output classification probabilities of the teacher and student models into two distinct probabilities of target and non-target speakers. Subsequently, the probabilities of non-target speakers are emphasized during KD using a specific weight. We utilize the large-scale SSL model WavLM-TDNN [14] as our teacher model and employ three different network architectures: x-vector [15], ResNet34 [16], and CAM++ [17] as student models. Our experiments show that DKD with an emphasis on the non-target speakers’ output probabilities, outperforms both embedding-level and conventional label-level KD methods across student models.

2 Methodology

2.1 The impact of non-target speakers for ASV

Refer to caption
Fig. 1: Vox1-O results (EER %) of x-vector model trained on a fixed number of utterances but varying numbers of speakers.

To validate the hypothesis that a larger set of non-target speakers benefits ASV models, we conducted a toy experiment. We trained the x-vector model using a fixed 100,000 training utterances of the VoxCeleb 2 dev set [18]. These training utterances are evenly distributed among each training speaker, hence increasing the number of speakers will lead to fewer training utterances per speaker. As depicted in Figure 1, the performance of the x-vector model consistently improves with an increasing number of speakers in the training set. This indicates that involving more non-target speakers enhances the model’s ability to distinguish the target speaker from others. Inspired by this finding, we further extract and emphasize non-target speaker knowledge during the knowledge distillation process.

Refer to caption
Fig. 2: Our Decoupled Knowledge Distillation (DKD) with an emphasis on non-target speaker knowledge in comparison with the embedding-level knowledge distillation (using cosine distance loss COSsubscriptCOS\mathcal{L}_{\text{COS}}caligraphic_L start_POSTSUBSCRIPT COS end_POSTSUBSCRIPT) and the conventional label-level knowledge distillation (using Kullback–Leibler divergence loss KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT). 𝒯𝒯\mathcal{T}caligraphic_T, 𝒮𝒮\mathcal{S}caligraphic_S, K𝐾Kitalic_K, and τ𝜏\tauitalic_τ denote the teacher model, the student model, the number of training speakers, and the target speaker, respectively. pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, pτ¯subscript𝑝¯𝜏p_{\bar{\tau}}italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT, and p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are respectively defined as Eq.(1) and Eq.(4). TSKDsubscriptTSKD\mathcal{L}_{\text{TSKD}}caligraphic_L start_POSTSUBSCRIPT TSKD end_POSTSUBSCRIPT, NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT and γ𝛾\gammaitalic_γ are defined as Eq.(6) and Eq.(8), respectively.

2.2 Rethinking conventional label-level KD

Following the reformulation of the conventional label-level KD for the computer vision task in [13], we interpret the conventional label-level KD loss for automatic speaker verification. In the training phase of the ASV model, the model’s output is classification probabilities 𝐩𝐩\mathbf{p}bold_p over the set 𝒦𝒦\mathcal{K}caligraphic_K of K training speakers, in which the probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the i-th speaker is computed using the softmax function to transform the logits vector zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into a probability distribution as follows:

pi=ezij=1Kezjsubscript𝑝𝑖superscript𝑒subscript𝑧𝑖superscriptsubscript𝑗1𝐾superscript𝑒subscript𝑧𝑗p_{i}=\frac{e^{z_{i}}}{\sum_{j=1}^{K}e^{z_{j}}}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (1)

In the conventional label-level KD, the student model tries to mimic the teacher model by minimizing the Kullback-Leibler Divergence DKLsubscript𝐷KLD_{\text{KL}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT between the student (𝒮𝒮\mathcal{S}caligraphic_S) and teacher (𝒯𝒯\mathcal{T}caligraphic_T) output probability distributions. The DKLsubscript𝐷KLD_{\text{KL}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT loss is defined as:

KD=DKL(𝐩𝒯𝐩𝒮)=iKpi𝒯log(pi𝒯pi𝒮)subscriptKDsubscript𝐷KLconditionalsuperscript𝐩𝒯superscript𝐩𝒮subscript𝑖𝐾subscriptsuperscript𝑝𝒯𝑖subscriptsuperscript𝑝𝒯𝑖subscriptsuperscript𝑝𝒮𝑖\mathcal{L}_{\text{KD}}=D_{\text{KL}}(\mathbf{p}^{\mathcal{T}}\|\mathbf{p}^{% \mathcal{S}})=\sum_{i\in K}p^{\mathcal{T}}_{i}\log(\frac{p^{\mathcal{T}}_{i}}{% p^{\mathcal{S}}_{i}})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ bold_p start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_K end_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_p start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) (2)

where 𝐩𝒯,𝐩𝒮superscript𝐩𝒯superscript𝐩𝒮\mathbf{p}^{\mathcal{T}},\mathbf{p}^{\mathcal{S}}bold_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT , bold_p start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT denote the output probabilities of the teacher and student networks, respectively. We further split the set 𝒦𝒦\mathcal{K}caligraphic_K of indexes {i=1K}𝑖1𝐾\{i=1\dots K\}{ italic_i = 1 … italic_K } in KDsubscriptKD\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT into the target speaker τ𝜏\tauitalic_τ and a set of non-target speakers 𝒦\{τ}\𝒦𝜏\mathcal{K}\backslash\{\tau\}caligraphic_K \ { italic_τ } as:

KD=pτ𝒯log(pτ𝒯pτ𝒮)+i𝒦\{τ}pi𝒯log(pi𝒯pi𝒮)subscriptKDsuperscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒮subscript𝑖\𝒦𝜏superscriptsubscript𝑝𝑖𝒯superscriptsubscript𝑝𝑖𝒯superscriptsubscript𝑝𝑖𝒮\mathcal{L}_{\text{KD}}=p_{\tau}^{\mathcal{T}}\log(\frac{p_{\tau}^{\mathcal{T}% }}{p_{\tau}^{\mathcal{S}}})+\sum_{i\in\mathcal{K}\backslash\{\tau\}}p_{i}^{% \mathcal{T}}\log(\frac{p_{i}^{\mathcal{T}}}{p_{i}^{\mathcal{S}}})caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) + ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) (3)

We define the probability of classifying a speaker belonging to 𝒦\{τ}\𝒦𝜏\mathcal{K}\backslash\{\tau\}caligraphic_K \ { italic_τ } as pτ¯subscript𝑝¯𝜏p_{\bar{\tau}}italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT, and the probability of predicting a specific non-target speaker iτ𝑖𝜏i\neq\tauitalic_i ≠ italic_τ over all non-target speakers as p^isubscript^𝑝𝑖\hat{p}_{i}over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

pτ¯=i𝒦\{τ}ezijKezj,p^i=ezij𝒦\{τ}ezjformulae-sequencesubscript𝑝¯𝜏subscript𝑖\𝒦𝜏superscript𝑒subscript𝑧𝑖subscript𝑗𝐾superscript𝑒subscript𝑧𝑗subscript^𝑝𝑖superscript𝑒subscript𝑧𝑖subscript𝑗\𝒦𝜏superscript𝑒subscript𝑧𝑗p_{\bar{\tau}}=\frac{\sum_{i\in\mathcal{K}\backslash\{\tau\}}e^{z_{i}}}{\sum_{% j\in K}e^{z_{j}}},\quad\hat{p}_{i}=\frac{e^{z_{i}}}{\sum_{j\in\mathcal{K}% \backslash\{\tau\}}e^{z_{j}}}italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_K end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (4)

From (1) and (4), we replace pi=pτ¯p^isubscript𝑝𝑖subscript𝑝¯𝜏subscript^𝑝𝑖p_{i}=p_{\bar{\tau}}\hat{p}_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in (3):

KDsubscriptKD\displaystyle\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT =pτ𝒯log(pτ𝒯pτ𝒮)+pτ¯𝒯i𝒦\{τ}p^i𝒯log(pτ¯𝒯p^i𝒯pτ¯𝒮p^i𝒮)absentsuperscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒮superscriptsubscript𝑝¯𝜏𝒯subscript𝑖\𝒦𝜏superscriptsubscript^𝑝𝑖𝒯superscriptsubscript𝑝¯𝜏𝒯superscriptsubscript^𝑝𝑖𝒯superscriptsubscript𝑝¯𝜏𝒮superscriptsubscript^𝑝𝑖𝒮\displaystyle=p_{\tau}^{\mathcal{T}}\log(\frac{p_{\tau}^{\mathcal{T}}}{p_{\tau% }^{\mathcal{S}}})+p_{\bar{\tau}}^{\mathcal{T}}\sum_{i\in\mathcal{K}\backslash% \{\tau\}}\hat{p}_{i}^{\mathcal{T}}\log(\frac{p_{\bar{\tau}}^{\mathcal{T}}\hat{% p}_{i}^{\mathcal{T}}}{p_{\bar{\tau}}^{\mathcal{S}}\hat{p}_{i}^{\mathcal{S}}})= italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) + italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG )
=pτ𝒯log(pτ𝒯pτ𝒮)+pτ¯𝒯i𝒦\{τ}p^i𝒯log(pτ¯𝒯pτ¯𝒮)absentsuperscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒮superscriptsubscript𝑝¯𝜏𝒯subscript𝑖\𝒦𝜏superscriptsubscript^𝑝𝑖𝒯superscriptsubscript𝑝¯𝜏𝒯superscriptsubscript𝑝¯𝜏𝒮\displaystyle=p_{\tau}^{\mathcal{T}}\log(\frac{p_{\tau}^{\mathcal{T}}}{p_{\tau% }^{\mathcal{S}}})+p_{\bar{\tau}}^{\mathcal{T}}\sum_{i\in\mathcal{K}\backslash% \{\tau\}}\hat{p}_{i}^{\mathcal{T}}\log(\frac{p_{\bar{\tau}}^{\mathcal{T}}}{p_{% \bar{\tau}}^{\mathcal{S}}})= italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) + italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG )
+pτ¯𝒯i𝒦\{τ}p^i𝒯log(p^i𝒯p^i𝒮)superscriptsubscript𝑝¯𝜏𝒯subscript𝑖\𝒦𝜏superscriptsubscript^𝑝𝑖𝒯superscriptsubscript^𝑝𝑖𝒯superscriptsubscript^𝑝𝑖𝒮\displaystyle+p_{\bar{\tau}}^{\mathcal{T}}\sum_{i\in\mathcal{K}\backslash\{% \tau\}}\hat{p}_{i}^{\mathcal{T}}\log(\frac{\hat{p}_{i}^{\mathcal{T}}}{\hat{p}_% {i}^{\mathcal{S}}})+ italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) (5)

Since pτ¯𝒯superscriptsubscript𝑝¯𝜏𝒯p_{\bar{\tau}}^{\mathcal{T}}italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, pτ¯𝒮superscriptsubscript𝑝¯𝜏𝒮p_{\bar{\tau}}^{\mathcal{S}}italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT are independent to the class index i and i𝒦\{τ}p^i𝒯=1subscript𝑖\𝒦𝜏superscriptsubscript^𝑝𝑖𝒯1\sum_{i\in\mathcal{K}\backslash\{\tau\}}\hat{p}_{i}^{\mathcal{T}}=1∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT = 1, we can simplify (2.2) to:

KDsubscriptKD\displaystyle\mathcal{L}_{\text{KD}}caligraphic_L start_POSTSUBSCRIPT KD end_POSTSUBSCRIPT =pτ𝒯log(pτ𝒯pτ𝒮)+pτ¯𝒯log(pτ¯𝒯pτ¯𝒮)DKL(𝐛𝒯𝐛𝒮)+pτ¯𝒯i𝒦\{τ}p^i𝒯log(p^i𝒯p^i𝒮)DKL(𝐩^𝒯𝐩^𝒮)absentsubscriptsuperscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒯superscriptsubscript𝑝𝜏𝒮superscriptsubscript𝑝¯𝜏𝒯superscriptsubscript𝑝¯𝜏𝒯superscriptsubscript𝑝¯𝜏𝒮subscript𝐷KLconditionalsuperscript𝐛𝒯superscript𝐛𝒮superscriptsubscript𝑝¯𝜏𝒯subscriptsubscript𝑖\𝒦𝜏superscriptsubscript^𝑝𝑖𝒯superscriptsubscript^𝑝𝑖𝒯superscriptsubscript^𝑝𝑖𝒮subscript𝐷KLconditionalsuperscript^𝐩𝒯superscript^𝐩𝒮\displaystyle=\underbrace{p_{\tau}^{\mathcal{T}}\log(\frac{p_{\tau}^{\mathcal{% T}}}{p_{\tau}^{\mathcal{S}}})+p_{\bar{\tau}}^{\mathcal{T}}\log(\frac{p_{\bar{% \tau}}^{\mathcal{T}}}{p_{\bar{\tau}}^{\mathcal{S}}})}_{D_{\text{KL}}(\mathbf{b% }^{\mathcal{T}}\|\mathbf{b}^{\mathcal{S}})}+p_{\bar{\tau}}^{\mathcal{T}}% \underbrace{\sum_{i\in\mathcal{K}\backslash\{\tau\}}\hat{p}_{i}^{\mathcal{T}}% \log(\frac{\hat{p}_{i}^{\mathcal{T}}}{\hat{p}_{i}^{\mathcal{S}}})}_{D_{\text{% KL}}(\mathbf{\hat{p}}^{\mathcal{T}}\|\mathbf{\hat{p}}^{\mathcal{S}})}= under⏟ start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) + italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) end_ARG start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( bold_b start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ bold_b start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT over¯ start_ARG italic_τ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_K \ { italic_τ } end_POSTSUBSCRIPT over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT roman_log ( divide start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT end_ARG ) end_ARG start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT (6)

From (6), the conventional label-level KD can be re-formulated into the sum of two terms: 1) Target Speaker Knowledge Distillation (TSKD) loss TSKDsubscriptTSKD\mathcal{L}_{\text{TSKD}}caligraphic_L start_POSTSUBSCRIPT TSKD end_POSTSUBSCRIPT: the DKLsubscript𝐷KLD_{\text{KL}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT over the binary classification probability 𝐛2𝐛superscript2\mathbf{b}\in\mathbb{R}^{2}bold_b ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of the target speaker and all non-target speakers, and 2) Non-Target Speaker Knowledge Distillation (NSKD) loss NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT: the DKLsubscript𝐷KLD_{\text{KL}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT of the multi-class classification probability 𝐩^K1^𝐩superscript𝐾1\mathbf{\hat{p}}\in\mathbb{R}^{K-1}over^ start_ARG bold_p end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT between K1𝐾1K-1italic_K - 1 non-target speakers as shown in (7) and Fig.2.

DKL(𝐩𝒯𝐩𝒮)=DKL(𝐛𝒯𝐛𝒮)TSKD+(1pτ𝒯)DKL(𝐩^𝒯𝐩^𝒮)NSKDsubscript𝐷KLconditionalsuperscript𝐩𝒯superscript𝐩𝒮subscriptsubscript𝐷KLconditionalsuperscript𝐛𝒯superscript𝐛𝒮subscriptTSKD1superscriptsubscript𝑝𝜏𝒯subscriptsubscript𝐷KLconditionalsuperscript^𝐩𝒯superscript^𝐩𝒮subscriptNSKD\displaystyle D_{\text{KL}}(\mathbf{p}^{\mathcal{T}}\|\mathbf{p}^{\mathcal{S}}% )=\underbrace{D_{\text{KL}}(\mathbf{b}^{\mathcal{T}}\|\mathbf{b}^{\mathcal{S}}% )}_{\mathcal{L}_{\text{TSKD}}}+(1-p_{\tau}^{\mathcal{T}})\underbrace{D_{\text{% KL}}(\mathbf{\hat{p}}^{\mathcal{T}}\|\mathbf{\hat{p}}^{\mathcal{S}})}_{% \mathcal{L}_{\text{NSKD}}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( bold_p start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ bold_p start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) = under⏟ start_ARG italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( bold_b start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ bold_b start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT TSKD end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ( 1 - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ) under⏟ start_ARG italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ∥ over^ start_ARG bold_p end_ARG start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT end_POSTSUBSCRIPT (7)

From the above equation, when the teacher model predicts the target speaker accurately, a large value of pτ𝒯superscriptsubscript𝑝𝜏𝒯p_{\tau}^{\mathcal{T}}italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT results in a smaller (1pτ𝒯)1superscriptsubscript𝑝𝜏𝒯(1-p_{\tau}^{\mathcal{T}})( 1 - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ), which leads to the suppression of NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT. This could potentially hinder the distillation of knowledge from non-target speakers in the label-level KD method.

2.3 Decoupled Knowledge Distillation with an emphasis on non-target speaker knowledge

In Section 2.1, it was demonstrated that leveraging more non-target speaker knowledge can enhance the performance of ASV models. In other words, NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT may play a crucial role in the knowledge transfer from the teacher to student models. Decoupled Knowledge Distillation (DKD) [13] proposed a modification to remove the dependency factor (1pτ𝒯)1superscriptsubscript𝑝𝜏𝒯(1-p_{\tau}^{\mathcal{T}})( 1 - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ) in (7) by introducing hyperparameters to balance the TSKDsubscriptTSKD\mathcal{L}_{\text{TSKD}}caligraphic_L start_POSTSUBSCRIPT TSKD end_POSTSUBSCRIPT and NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT. However, to place a greater emphasis on NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT, we adjusted the original DKD method by simply replacing (1pτ𝒯)1superscriptsubscript𝑝𝜏𝒯(1-p_{\tau}^{\mathcal{T}})( 1 - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT ) with the hyperparameter γ𝛾\gammaitalic_γ in the following manner:

DKD=TSKD+γNSKDsubscriptDKDsubscriptTSKD𝛾subscriptNSKD\mathcal{L}_{\text{DKD}}=\mathcal{L}_{\text{TSKD}}+\gamma\mathcal{L}_{\text{% NSKD}}caligraphic_L start_POSTSUBSCRIPT DKD end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT TSKD end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT (8)

Finally, the DKD loss DKDsubscriptDKD\mathcal{L}_{\text{DKD}}caligraphic_L start_POSTSUBSCRIPT DKD end_POSTSUBSCRIPT is combined with the classification loss to optimize the student model. Fig. 2 illustrates the comparison between DKD with an emphasis on non-target speaker knowledge, embedding-level, and conventional label-level knowledge distillation.

Table 1: Results on the VoxCeleb1 test sets. COS𝐶𝑂𝑆COSitalic_C italic_O italic_S and KLD𝐾𝐿𝐷KLDitalic_K italic_L italic_D denote embedding-level and conventional label-level KD
System Params (M) FLOPs (G) Distillation Method EER (%) / minDCF
Vox1-O Vox1-E Vox1-H
Teacher model
WavLM-TDNN [14]
316.62 similar-to\sim26 - 0.383 /  - 0.480 /  - 0.986 /  -
TDNN-based Student model x-vector [15] 4.61 0.53 - 1.835 /  - 1.822 /  - 3.110 /  -
COS𝐶𝑂𝑆COSitalic_C italic_O italic_S 1.760 / 0.189 1.742 / 0.185 2.879 / 0.255
KLD𝐾𝐿𝐷KLDitalic_K italic_L italic_D 1.585 / 0.171 1.589 / 0.171 2.704 / 0.244
Ours 1.319 / 0.160 1.388 / 0.155 2.440 / 0.226
CNN-based Student model ResNet34 [16] 6.64 4.55 - 0.862 / 0.089 1.035 / 0.112 1.827 / 0.176
COS𝐶𝑂𝑆COSitalic_C italic_O italic_S 0.829 / 0.088 0.943 / 0.107 1.694 / 0.164
KLD𝐾𝐿𝐷KLDitalic_K italic_L italic_D 0.771 / 0.086 0.939 / 0.103 1.728 / 0.166
Ours 0.766 / 0.101 0.850 / 0.096 1.615 / 0.161
D-TDNN-based Student model CAM++[17] 7.18 1.72 - 0.718 /  - 0.879 /  - 1.735 /  -
COS𝐶𝑂𝑆COSitalic_C italic_O italic_S 0.713 / 0.118 0.901 / 0.108 1.768 / 0.182
KLD𝐾𝐿𝐷KLDitalic_K italic_L italic_D 0.633 / 0.101 0.790 / 0.093 1.572 / 0.159
Ours 0.590 / 0.118 0.735 / 0.085 1.494 / 0.148

3 Experiments Setup

3.1 Dataset

We utilized the VoxCeleb2 dev dataset [18] for training and evaluated the performance on three test trials, Vox1-O, Vox1-E, and Vox1-H. During training, we applied data augmentation using the MUSAN noise corpus [19] and RIRs reverberation [20], with a probability of 0.6.

3.2 Model

The teacher model is the SSL-based ASV system [14] combining WavLM Large and ECAPA-TDNN [21]. On the other hand, we utilized various network architectures for our student models including TDNN-based x-vector [15], CNN-based ResNet-34 [16], and D-TDNN-based CAM++ [17].

3.3 Training and Evaluation

During the training, each audio sample was randomly cropped to a 2-second segment, then 80-dimensional Fbank features were extracted using a frame length of 25 ms and a frameshift of 10 ms. For the classification loss function, we employed the AAM-softmax [22] with a scale of 32 and a margin scheduler. In the proposed KD method, from the ablation study in Section 4.2, the value γ𝛾\gammaitalic_γ in (8) is set to 2.0 in all the remaining experiments. For evaluation, speaker embeddings were scored using cosine similarity and score normalization. Performance is reported on two metrics: Equal Error Rate (EER) and the minimum of the normalized detection cost function (MinDCF) with Ptarget=0.01subscript𝑃𝑡𝑎𝑟𝑔𝑒𝑡0.01P_{target}=0.01italic_P start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT = 0.01 and Cfa=Cmiss=1subscript𝐶𝑓𝑎subscript𝐶𝑚𝑖𝑠𝑠1C_{fa}=C_{miss}=1italic_C start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT = 1. All experiments are conducted using Wespeaker toolkit [23].

4 Results and Analysis

4.1 Results of the proposed method

Table 1 presents a comparison of the performance of teacher and student models trained solely with classification loss, along with the results of different knowledge distillation methods. Although both embedding-level and conventional label-level KD methods outperform student networks trained solely with classification loss, the improvement remains limited. Under the limited number of parameters and floating-point operations (FLOPs), the smallest student model x-vector using DKD emphasizing non-target speakers exhibits the largest improvement of 28.12% in Vox1-O EER, compared to its baseline trained with classification loss only. Moreover, our proposed method enables the state-of-the-art CAM++ model to further boost its performance with an EER of 0.590%, while the model’s size and FLOPs are respectively 97.73% and 93.39% smaller than the teacher model. Lastly, all three student networks trained using our proposed method have a better result than the embedding-level and conventional label-level KD methods, especially in challenging sets like Vox1-E and Vox1-H. This indicates that DKD emphasizing non-target speaker probabilities effectively improves the performance of student models.

4.2 Ablation Study: The impact of NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT

We conducted an ablation study on the hyperparameters γ𝛾\gammaitalic_γ in the DKD formula to show how the robustness of the proposed method varies. To save computational cost, we solely present the results of the student model x-vector, which are summarized in Table 2. When emphasizing the NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT with non-zero values of γ𝛾\gammaitalic_γ, all the results exceed the performance of γ=1pτ𝒯𝛾1superscriptsubscript𝑝𝜏𝒯\gamma=1-p_{\tau}^{\mathcal{T}}italic_γ = 1 - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT, which is equivalent to the result of the conventional label-level knowledge distillation method. It is observed that removing NSKD𝑁𝑆𝐾𝐷NSKDitalic_N italic_S italic_K italic_D by assigning γ=0𝛾0\gamma=0italic_γ = 0 obtains a worse result than the conventional label-level knowledge distillation. In alignment with the findings from Section 2.1, an increasing value of γ𝛾\gammaitalic_γ leads to better performance, implying the increased significance of NSKDsubscriptNSKD\mathcal{L}_{\text{NSKD}}caligraphic_L start_POSTSUBSCRIPT NSKD end_POSTSUBSCRIPT. Notably, the best hyperparameter configuration of γ=2𝛾2\gamma=2italic_γ = 2 achieved an average of 13% improvements in EER compared to the conventional label-level knowledge distillation.

Table 2: Results of x-vector using different γ𝛾\gammaitalic_γ values in Eq.( 8)
γ𝛾\gammaitalic_γ EER (%) / minDCF
Vox1-O Vox1-E Vox1-H
1pτ𝒯1superscriptsubscript𝑝𝜏𝒯1-p_{\tau}^{\mathcal{T}}1 - italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_T end_POSTSUPERSCRIPT 1.585 / 0.171 1.589 / 0.171 2.704 / 0.244
0.00.00.00.0 1.622 / 0.152 1.646 / 0.175 2.786 / 0.252
1.01.01.01.0 1.463 / 0.166 1.452 / 0.155 2.520 / 0.225
2.02.02.02.0 1.319 / 0.160 1.388 / 0.155 2.440 / 0.226
4.04.04.04.0 1.361 / 0.143 1.415 / 0.156 2.511 / 0.229

5 Conclusion

This paper has shown the benefit of leveraging non-target speakers for training automatic speaker verification models. Based on this finding, we modified the conventional label-level KD to emphasize the classification probabilities of non-target speakers, which involves splitting and amplifying the non-target speaker’s probabilities during the knowledge distillation process. Experimental results on the VoxCeleb test sets show an average of 13.67% improvement in EER of the proposed method compared to other knowledge distillation methods across three different architecture student models.

6 Acknowledgement

The computational work for this article was partially performed on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg).

References

  • [1] Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swietojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio, “Multi-task self-supervised learning for robust speech recognition,” in IEEE ICASSP, 2020, pp. 6989–6993.
  • [2] Wei-Tsung Kao, Yuan-Kuei Wu, Chia-** Chen, Zhi-Sheng Chen, Yu-Pao Tsai, and Hung-Yi Lee, “On the efficiency of integrating self-supervised learning and meta-learning for user-defined few-shot keyword spotting,” in SLT, 2023, pp. 414–421.
  • [3] Tarun Gupta, Duc-Tuan Truong, Tran The Anh, and Chng Eng Siong, “Estimation of speaker age and height from speech signal using bi-encoder transformer mixture model,” in Proc. INTERSPEECH, 2022, pp. 1978–1982.
  • [4] Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, and Haizhou Li, “Self-supervised speaker recognition with loss-gated learning,” in IEEE ICASSP, 2022, pp. 6142–6146.
  • [5] Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, and Haizhou Li, “Disentangling voice and content with self-supervision for speaker recognition,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [6] Shuai Wang, Yexin Yang, Tianzhe Wang, Yanmin Qian, and Kai Yu, “Knowledge distillation for small foot-print deep speaker embedding,” in IEEE ICASSP, 2019, pp. 6021–6025.
  • [7] Zhiyuan Peng, Xuanji He, Ke Ding, Tan Lee, and Guanglu Wan, “Label-free knowledge distillation with contrastive loss for light-weight speaker recognition,” in ISCSLP, 2022, pp. 324–328.
  • [8] Xuechen Liu, Md Sahidullah, and Tomi Kinnunen, “Distilling multi-level x-vector knowledge for small-footprint speaker verification,” in arXiv preprint arXiv:2303.01125, 2023.
  • [9] Jungwoo Heo, Chan yeong Lim, Ju ho Kim, Hyun seo Shin, and Ha-** Yu, “One-Step Knowledge Distillation and Fine-Tuning in Using Large Pre-Trained Self-Supervised Learning Models for Speaker Verification,” in Proc. INTERSPEECH, 2023, pp. 5271–5275.
  • [10] Leying Zhang, Zhengyang Chen, and Yanmin Qian, “Knowledge distillation from multi-modality to single-modality for person verification,” in Proc. INTERSPEECH, 2021, pp. 1897–1901.
  • [11] Nik Vaessen and David van Leeuwen, “Training speaker recognition systems with limited data,” in Proc. Interspeech 2022, 2022, pp. 4760–4764.
  • [12] Yaobin Zhang and Weihong Deng, “Class-balanced training for deep face recognition,” in IEEE/CVF CVPRW, 2020, pp. 3594–3603.
  • [13] Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang, “Decoupled knowledge distillation,” in IEEE/CVF CVPR, 2022, pp. 11943–11952.
  • [14] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, **yu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  • [15] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in IEEE ICASSP, 2018, pp. 5329–5333.
  • [16] Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, and Oldřich Plchot, “BUT system description to VoxCeleb speaker recognition challenge 2019,” in arXiv preprint arXiv:1910.12592, 2019.
  • [17] Hui Wang, Siqi Zheng, Yafeng Chen, Luyao Cheng, and Qian Chen, “CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking,” in Proc. INTERSPEECH, 2023, pp. 5301–5305.
  • [18] Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman, “VoxCeleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, pp. 101027, 2020.
  • [19] David Snyder, Guoguo Chen, and Daniel Povey, “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
  • [20] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in IEEE ICASSP, 2017, pp. 5220–5224.
  • [21] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Proc. INTERSPEECH, 2020, pp. 3830–3834.
  • [22] Jiankang Deng et al, “ArcFace: Additive angular margin loss for deep face recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 5962–5979, oct 2022.
  • [23] Hongji Wang, Chengdong Liang, Shuai Wang, Zhengyang Chen, Binbin Zhang, Xu Xiang, Yanlei Deng, and Yanmin Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE ICASSP, 2023, pp. 1–5.