Self-Distillation Prototypes Network: Learning Robust Speaker Representations without Supervision

Abstract

Training speaker-discriminative and robust speaker verification systems without explicit speaker labels remains a persisting challenge. In this paper, we propose a new self-supervised speaker verification approach, Self-Distillation Prototypes Network (SDPN), which effectively facilitates self-supervised speaker representation learning. SDPN assigns the representation of the augmented views of an utterance to the same prototypes as the representation of the original view, thereby enabling effective knowledge transfer between the views. Originally, due to the lack of negative pairs in the SDPN training process, the network tends to align positive pairs very closely in the embedding space, a phenomenon known as model collapse. To alleviate this problem, we introduce a diversity regularization term to embeddings in SDPN. Comprehensive experiments on the VoxCeleb datasets demonstrate the superiority of SDPN in self-supervised speaker verification. SDPN sets a new state-of-the-art on the VoxCeleb1 speaker verification evaluation benchmark, achieving Equal Error Rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H respectively111Code is publicly available at https://github.com/modelscope/3D-Speaker, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the verification performance.

Index Terms—  Speaker verification, self-supervised learning, non-contrastive methods, model collapse, self-distillation prototypes network

1 Introduction

With the great success of deep learning, speaker verification (SV) systems have achieved remarkable progress in recent years. The availability of large-scale labeled datasets is critical for the deep learning performance. However, collecting large amounts of real-world labeled SV data is laborious and expensive. Therefore, it is of great interest to explore approaches mitigating the dependence on labeled data, such as self-supervised learning (SSL).

SSL methods learn representations of data without supervision (e.g., class labels). SSL methods can be roughly categorized into contrastive [1, 2, 3, 4, 5, 6] and non-contrastive approaches [7, 8, 9, 10, 11, 12, 13, 14], depending on whether the training process involves negative samples or not. Contrastive methods require large batch sizes or special techniques such as memory banks to attain high performance. More importantly, the quality of training is highly dependent on the correctness of negative pairs. For SSL based SV, since there are no speaker labels, a common practice is to construct positive pairs from the same utterance while constructing negative pairs from different utterances. Hence within a batch, negative pairs may come from the same speaker and would be incorrect.

In contrast, non-contrastive SSL methods do not have this issue because negative samples are not required in the training process, yet they have shown comparable or better performance compared to contrastive counterparts [7, 8, 9, 10, 11, 12, 13, 14]. Non-contrastive SSL methods can be broadly classified into single-stage [7, 8, 9, 10, 11, 12] and multi-stage training methods [13, 14]. Differences between single-stage and multi-stage training methods are detailed in [11].

Refer to caption
Fig. 1: Overview of our proposed Self-Distillation Prototypes Network (SDPN). It comprises a teacher and a student network with identical architecture but different parameters. Teacher/student network consists of three main modules: the encoder for extracting speaker embeddings, multi-layer perceptron for feature transformation, and prototypes for computing soft-distributions between global and local views. EMA denotes Exponential Moving Average.

Note that the multi-stage training methods require an assumption of the rough number of speakers in the entire dataset and use a clustering algorithm to generate pseudo speaker labels for each utterance, which contradicts the original purpose of SSL. Therefore, in this work, we adopt the line of single-stage non-contrastive approaches for exploring self-supervised SV. Among prior works in this line, Sang et al. [7] propose a regularization structure inspired by BYOL [8]. In BYOL, an online network predicts a target network representation of the same utterance under different augmented views. Caron et al. [9] propose a self-distillation framework DINO that comprises a teacher and a student network. The outputs of the teacher network are used as ground truth to optimize the student network. Heo et al. [10] further introduce a curriculum learning strategy to DINO to guide model training. Due to the lack of negative pairs, these non-contrastive SSL methods are inclined to map positive pairs to the same or similar positions in the embedding space, resulting in degenerate solutions, a phenomenon known as model collapse. To alleviate model collapse, Chen et al. [11] propose two regularization terms in DINO. [12] enhance SV performance via applying augmentation strategies to DINO.

Although the application of DINO and BYOL for self-supervised speaker representation learning is promising [7, 12, 15], the key limitation of these non-contrastive methods lies in their emphasis on minimizing the distance between representations of the same utterance , while overlooking the relationship between representations of utterances from different speakers, which substantially limits their capability.

To tackle this limitation, we propose a novel non-contrastive Self-Distillation Prototypes Network (SDPN) for self-supervised speaker verification. In order to alleviate model collapse within the SDPN framework, we also introduce a regularization term to embeddings. Our contributions can be summarized as follows:

  • We propose a novel Self-Distillation Prototypes Network (SDPN) for self-supervised speaker verification. Our first key innovation is introducing learnable prototypes in self-distillation framework to utilize unlabeled data to learn robust speaker-discriminative embeddings. The role of prototypes is similar to that of a classifier in fully supervised learning. SDPN explicitly minimizes the distance between different augmented views of the same utterance and concurrently, implicitly differentiates the “class center” within the prototypes (which are vectors in the prototypes and similar to the center of classifier in supervised training) by assigning representations of different utterances to their corresponding basis vectors in prototypes. In this way, SDPN enhances clustering of embeddings of the same utterance while separating those from different speakers, effectively addressing the limitation of non-contrastive SSL.

  • Our second key innovation is proposing a diversity regularization term and integrating it within SDPN. This component computes the pairwise similarity among embeddings and actively segregates the nearest embeddings to promote the diversity of speaker embeddings within a batch. Diversity regularization further enhance the robustness of speaker embeddings.

  • Experiments on the VoxCeleb1 speaker verification evaluation benchmark demonstrate the superiority of our SDPN for self-supervised SV, substantially narrowing the performance gap between SSL and supervised methods for SV. SDPN sets a new state of the art (SOTA) on VoxCeleb1, achieving equal error rate 1.80%, 1.99%, and 3.62% for trial VoxCeleb1-O, VoxCeleb1-E and VoxCeleb1-H, without using any speaker labels in training. Ablation studies show that both proposed learnable prototypes in self-distillation network and diversity regularization contribute to the speaker verification performance.

2 Proposed Method

Fig. 1 illustrates the architecture of the proposed self-distillation framework SDPN. The upper branch in Fig. 1 presents the teacher network, while the lower branch shows the student network. Both networks share the same architecture with different parameters.

The teacher/student network comprises an encoder f for extracting speaker embeddings and a multi-layer perceptron (MLP) h for providing a non-linear transformation for these embeddings. We adopt ECAPA-TDNN [16] as the encoder. The MLP h comprises three fully connected layers with hidden dimensions of 2048-2048-256, followed by L2 normalization. The output of the first two layers passes through both batch normalization [17] and GELU activation functions [18]. The learnable prototypes c are shared between the teacher and student networks, and are used to compute the soft distributions between global and local views. Global views and local views refer to long and short segments randomly segmented from the same utterance respectively. Global views are fed into the teacher network and local views are fed into the student network. The global information learned by the teacher network guides the training of the student network, therefore enforcing the local-to-global correspondences.

Specifically, we adopt a multi-crop strategy to sample one global view Xg={xg}subscriptX𝑔subscriptx𝑔\textbf{X}_{g}=\{\textbf{x}_{g}\}X start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = { x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT } and four local views Xl={xl1,xl2,xl3,xl4}subscriptX𝑙subscriptxsubscript𝑙1subscriptxsubscript𝑙2subscriptxsubscript𝑙3subscriptxsubscript𝑙4\textbf{X}_{l}=\{\textbf{x}_{l_{1}},\textbf{x}_{l_{2}},\textbf{x}_{l_{3}},% \textbf{x}_{l_{4}}\}X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { x start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , x start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } from an utterance. Data augmentation has been proven to be crucial for both supervised and self-supervised representation learning. In order to adequately capture the utterance-dependent variability in speaker embeddings, we explore two kinds of augmentation strategies, WavAugment [19] and SpecAugment [20] on the local views in SDPN, while the global view is left unchanged. Then, XgsubscriptX𝑔\textbf{X}_{g}X start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is first encoded by the teacher’s encoder fϑteasubscriptsuperscript𝑓𝑡𝑒𝑎italic-ϑf^{tea}_{\vartheta}italic_f start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT and the resulting representations are taken as speaker embeddings. Next, speaker embeddings are mapped through the teacher’s MLP hϑteasubscriptsuperscript𝑡𝑒𝑎italic-ϑh^{tea}_{\vartheta}italic_h start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT. At the same time, the four local views XlsubscriptX𝑙\textbf{X}_{l}X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are encoded by the student’s encoder fθstusubscriptsuperscript𝑓𝑠𝑡𝑢𝜃f^{stu}_{\theta}italic_f start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and then mapped through the student’s MLP hθstusubscriptsuperscript𝑠𝑡𝑢𝜃h^{stu}_{\theta}italic_h start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The parameters ϑitalic-ϑ\varthetaitalic_ϑ of the teacher encoder and MLP are updated via an Exponential Moving Average (EMA) of the parameters θ𝜃\thetaitalic_θ of the student encoder and MLP. To train the encoders, we compute distributions of similarity between the d𝑑ditalic_d-dimensional prototypes c and each pair of local view and global view. The cross-entropy (CE) loss is calculated to minimize the probability distribution as follows:

LCE=xXgx𝐗lH(Ptea(x)Pstu(x))subscript𝐿𝐶𝐸subscriptxsubscriptX𝑔subscriptsuperscriptxsubscript𝐗𝑙𝐻conditionalsuperscript𝑃𝑡𝑒𝑎xsuperscript𝑃𝑠𝑡𝑢superscriptxL_{CE}=\sum_{\textbf{x}\in\textbf{X}_{g}}\sum_{\begin{subarray}{c}\textbf{x}^{% \prime}\in\mathbf{X}_{l}\end{subarray}}H(P^{tea}(\textbf{x})\mid P^{stu}(% \textbf{x}^{\prime}))italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT x ∈ X start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_X start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT italic_H ( italic_P start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT ( x ) ∣ italic_P start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT ( x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) (1)
Ptea(x)=Sknorm(hϑtea(fϑtea(x))c/τt)superscript𝑃𝑡𝑒𝑎x𝑆𝑘𝑛𝑜𝑟𝑚superscriptsubscriptitalic-ϑ𝑡𝑒𝑎superscriptsubscript𝑓italic-ϑ𝑡𝑒𝑎xcsubscript𝜏𝑡P^{tea}(\textbf{x})=Sknorm\left(h_{\vartheta}^{tea}(f_{\vartheta}^{tea}(% \textbf{x}))\cdot\textbf{c}/\tau_{t}\right)italic_P start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT ( x ) = italic_S italic_k italic_n italic_o italic_r italic_m ( italic_h start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT ( x ) ) ⋅ c / italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2)
Pstu(x)=Softmax(hθstu(fθstu(x))c/τs)superscript𝑃𝑠𝑡𝑢superscriptx𝑆𝑜𝑓𝑡𝑚𝑎𝑥superscriptsubscript𝜃𝑠𝑡𝑢superscriptsubscript𝑓𝜃𝑠𝑡𝑢superscriptxcsubscript𝜏𝑠P^{stu}(\textbf{x}^{\prime})=Softmax\left(h_{\theta}^{stu}(f_{\theta}^{stu}(% \textbf{x}^{\prime}))\cdot\textbf{c}/\tau_{s}\right)italic_P start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT ( x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT ( x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ⋅ c / italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (3)

where H(a|b)=alogb𝐻conditional𝑎𝑏𝑎𝑏H(a|b)=-a*\log bitalic_H ( italic_a | italic_b ) = - italic_a ∗ roman_log italic_b is cross-entropy. Pteasuperscript𝑃𝑡𝑒𝑎P^{tea}italic_P start_POSTSUPERSCRIPT italic_t italic_e italic_a end_POSTSUPERSCRIPT and Pstusuperscript𝑃𝑠𝑡𝑢P^{stu}italic_P start_POSTSUPERSCRIPT italic_s italic_t italic_u end_POSTSUPERSCRIPT denote the output probability distributions of the teacher network and the student network, where τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and τssubscript𝜏𝑠\tau_{s}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are temperature parameters that control the sharpness of the teacher’s and student’s output distributions. Sknorm𝑆𝑘𝑛𝑜𝑟𝑚Sknormitalic_S italic_k italic_n italic_o italic_r italic_m denotes Sinkhorn-Knopp (SK) batch normalization [21], which helps stabilize the teacher network. Softmax is applied to the student’s output. The temperature of the teacher is lower for sharper output.

Table 1: Results on VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets. DINO* refers to our replication of the baseline DINO framework. SDPN w/o PRO denotes removing prototypes from our proposed SDPN. SDPN w/o PRO-2 denotes removing prototypes but increasing the number of parameters in the MLP layers, resulting in the same model size as SDPN. SDPN w/ DR denotes adding the diversity regularization to SDPN. The best results for each test set are in bold.
Architecture Params (M) VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H
EER (%) \downarrow MinDCF \downarrow EER (%) \downarrow MinDCF \downarrow EER (%) \downarrow MinDCF \downarrow
DINO* 90.68 2.65 0.202 2.74 0.188 5.02 0.304
SDPN 57.24 2.13 0.170 2.12 0.150 4.15 0.263
SDPN w/o PRO 56.98 3.76 0.278 3.93 0.271 7.28 0.436
SDPN w/o PRO-2 57.29 3.75 0.274 3.90 0.270 7.22 0.433
SDPN w/ DR 57.24 1.80 0.139 1.99 0.131 3.62 0.219

We further design a diversity regularization term to encourage the diversity of the embeddings within a batch. This regularizer forces the embeddings of utterances to be different and hence prevents trivial solutions, i.e., alleviating model collapse. First, we compute the pairwise similarity of speaker embeddings within a batch. Next, all the closest embeddings are separated in order to reinforce their dissimilarity. The diversity regularization loss is calculated as follows:

LDR=1ni=1n(j=1nlog(minji𝐱i𝐱j))subscript𝐿𝐷𝑅1𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛subscript𝑗𝑖normsubscript𝐱𝑖subscript𝐱𝑗L_{DR}=-\frac{1}{n}\sum_{i=1}^{n}(\sum_{j=1}^{n}\log(\min_{j\neq i}||\mathbf{x% }_{i}-\mathbf{x}_{j}||))italic_L start_POSTSUBSCRIPT italic_D italic_R end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_log ( roman_min start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | | bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | ) ) (4)

Diversity regularization implicitly considers relationship between different categories and further complements SDPN. The overall loss is a combination of the CE loss and diversity regularization loss, weighted by the hyperparameter μ𝜇\muitalic_μ.

L=LCE+μLDR𝐿subscript𝐿𝐶𝐸𝜇subscript𝐿𝐷𝑅L=L_{CE}+\mu L_{DR}italic_L = italic_L start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT + italic_μ italic_L start_POSTSUBSCRIPT italic_D italic_R end_POSTSUBSCRIPT (5)

3 Experiments

3.1 Experimental Setup

Datasets and Evaluation Metrics We evaluate the efficacy of the proposed approach on the commonly used VoxCeleb datasets. Specifically, we adopt the development portions of VoxCeleb2 [22] for training, which comprises 1,092,009 utterances across 5,994 speakers. No speaker labels are used in training in any experiment. We report the experimental results on three trials on VoxCeleb1 [23], in terms of two metrics, namely, the equal error rate (EER) and the minimum of the normalized detection cost function (MinDCF) with the settings of Ptargetsubscript𝑃𝑡𝑎𝑟𝑔𝑒𝑡P_{target}italic_P start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT = 0.05 and Cfa=Cmisssubscript𝐶𝑓𝑎subscript𝐶𝑚𝑖𝑠𝑠C_{fa}=C_{miss}italic_C start_POSTSUBSCRIPT italic_f italic_a end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_m italic_i italic_s italic_s end_POSTSUBSCRIPT = 1.

Input Features For each utterance, we use the multi-crop strategy (Section 2 for SDPN training in which 4s segments and 2s segments are taken as global views and local views, respectively.) The acoustic features used in the experiments are 80-dimensional Filter Bank (FBank) with 25ms windows and 10ms shift. Speech Activity Detection (SAD) is not performed since the training data mostly consists of continuous speech.

Data Augmentation We explore WavAugment and SpecAugment in SDPN. For WavAugment, the MUSAN corpus [24] with SNR between 0 to 15 for additive noise and Room Impulse Response (RIR) [25] for reverberation are randomly applied to each local view. For SpecAugment, one time mask and one frequency mask are randomly applied to the FBank features of the local views. The time masking length is 0 to 10 frames and the frequency masking length is 0 to 6 dimensions.

Implementation Details In our experiments, we use ECAPA-TDNN with attentive statistical pooling as the encoder. The weight decay is fixed at 5e-5. The learning rate scheduling starts with 10 warm-up epochs with a linear increase from 0 to 0.4, followed by a cosine decay with a final learning rate 1e-5. We train the model with 150 epochs using the stochastic gradient descent optimizer [26] with momentum 0.9, on 4 NVIDIA V100 GPUs. The temperature τtsubscript𝜏𝑡\tau_{t}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and τssubscript𝜏𝑠\tau_{s}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are set to 0.04 and 0.1 respectively. The hyperparameter μ𝜇\muitalic_μ is set to 0.1. The dimension of each prototype is set to 256. The dimension of speaker embeddings is 512.

Table 2: Comparison between our SDPN w/ DR results and results cited from recent SSL models on the VoxCeleb-O test set.
Model Embedding Extractor EER(%)
AP+AAT [27] Fast ResNet34 8.65
MoCo + WavAug [1] TDNN 8.23
SSReg [7] Fast ResNet34 6.99
Mixup-Aug [6] Fast ResNet34 5.84
DINO + CL [10] ECAPA-TDNN 4.47
DINO [15] ECAPA-TDNN 3.30
RDINO [11] ECAPA-TDNN 3.24
DINO-Aug [12] ECAPA-TDNN 2.51
C3-DINO [15] ECAPA-TDNN 2.50
SDPN w/ DR ECAPA-TDNN 1.80
Refer to caption
Fig. 2: The t-SNE visualizations depict the extracted embeddings for five speakers, each denoted by a distinct color. The left figure displays the speaker embeddings derived from the baseline DINO, while the right figure shows those obtained from our SDPN with diversity regularization. The embeddings from SDPN with diversity regularization clearly exhibit enhanced separation compared to those from DINO, suggesting improved discriminability.

3.2 Results and Analysis

DINO is currently the most prevalent SSL framework for SV and the majority of competitive models including SOTA C3-DINO [15] are based on DINO (as shown in Table 2); hence Table 1 compares the results from our SDPN and our reproduced baseline DINO222Our reproduced DINO achieved a better 2.65% EER on VoxCeleb1-O while the original DINO paper [15] reported 3.30% EER on this test set. on the three VoxCeleb1-{O,E,H} test sets. Comparing row 1 and 2 shows that SDPN outperforms DINO substantially and consistently across different test sets, achieving EERs of 2.13%, 2.12%, and 4.15%, with much smaller model size (57.24M), only 63% of the size of DINO (90.68M), which demonstrates the superiority of SDPN in self-supervised SV. Comparing rows 2 and 3 demonstrates that removing the prototypes causes a notable degradation in verification performance. We hypothesize that the learnable prototypes implicitly capture relationship between different classes without relying on labeled data. Since the prototypes have 0.26M parameters, to understand how much the performance degradation from row 3 over row 2 is due to the smaller model size, we evaluate a counterpart SDPN w/o PRO-2, which compensates the model size reduction from removing prototypes by increasing the number of parameters in the MLP layers by 0.31M, resulting in the same model size as SDPN. We find that SDPN w/o PRO-2 obtains EER 3.75% on VoxCeleb1-O, on par with 3.76% EER from SDPN w/o PRO, yet worse than 2.13% EER from SDPN. These results clearly demonstrate that the learnable prototypes contribute substantially to the verification performance.

Comparing rows 2 and 5 shows that adding diversity regularization (SDPN w/ DR) yields relative improvements in EERs on the three test sets by 15.49%, 6.13%, and 12.77%. Diversity regularization also consistently improves MinDCF. We employ the t-distributed Stochastic Neighbor Embedding (t-SNE) [28] to compare the disentanglement performance of speaker embeddings derived from both DINO and SDPN w/ DR, as illustrated in Fig. 2. It is clear that the embeddings extracted via SDPN w/ DR exhibit superior clustering capabilities compared to those from DINO, suggesting that SDPN makes speaker embeddings more discriminative.

Table 3: The effect of the weight μ𝜇\muitalic_μ of diversity regularization (Eq.5) on SDPN with diversity regularization.
Weight VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H
μ𝜇\muitalic_μ EER MinDCF EER MinDCF EER MinDCF
0 2.13% 0.170 2.12% 0.150 4.15% 0.263
0.05 1.94% 0.134 2.02% 0.132 3.67% 0.227
0.1 1.80% 0.139 1.99% 0.131 3.62% 0.219
0.2 1.95% 0.157 2.08% 0.136 3.74% 0.223
Table 4: The impact of data augmentation in SDPN with diversity regularization. Aug. denotes Augment and Wav+Spec. denotes the combination of WavAugment and SpecAugment.
VoxCeleb1-O VoxCeleb1-E VoxCeleb1-H
EER MinDCF EER MinDCF EER MinDCF
No Aug. 3.62% 0.268 4.34% 0.266 6.99% 0.407
WavAug. 1.88% 0.153 2.06% 0.139 3.72% 0.228
SpecAug. 4.27% 0.317 5.03% 0.309 7.69% 0.447
Wav+Spec. 1.80% 0.139 1.99% 0.131 3.62% 0.219

We compare our SDPN to recently proposed non-contrastive SSL approaches, including [27, 6, 1, 7, 10, 11, 12], and the current SSL SOTA C3-DINO [15] which integrates contrastive and non-contrastive methods, as shown in the first group in Table 2. On VoxCeleb1-O test set, our non-contrastive SDPN with diversity regularization achieves 1.80% EER with a simple cosine distance scoring method, outperforming the SOTA C3-DINO (2.50% EER) by 28.0% relative improvement.

We analyze the impact of the weight of diversity regularization on the performance, as shown in Table 3. We observe that applying diversity regularization outperforms DINO with even a small weight μ=0.05𝜇0.05\mu=0.05italic_μ = 0.05. SDPN w/ DR achieves 1.80% EER with μ=0.1𝜇0.1\mu=0.1italic_μ = 0.1.

Additionally, we study the effect of different data augmentation strategies on the training data, as shown in Table 4. We find that WavAugment notably improves SDPN performance and when no augmentation is applied in training, it is difficult for the entire network to converge due to the inherent property of a non-contrastive framework. Without WavAugment, SpecAugment degrades the performance since it uses erasing operation on the acoustic features. Yet Combining WavAugment and SpecAugment outperforms WavAugment, suggesting that these augmentation strategies are complementary.

4 Conclusion

We propose a novel self-distillation prototypes network (SDPN) with diversity regularization (DR) for self-supervised learning for speaker verification (SV), utilizing unlabeled data to learn robust speaker-discriminative embeddings. SDPN enhances clustering of embeddings from the same utterance while separating those from different speakers, addressing limitations of traditional non-contrastive methods. Diversity regularization alleviates the model collapse problem in non-contrastive frameworks. Comprehensive experiments demonstrate the superiority of SDPN, which further narrows the performance gap between SSL and supervised methods for speaker verification.

References

  • [1] Wei Xia, Chunlei Zhang, Chao Weng, Meng Yu, and Dong Yu, “Self-supervised text-independent speaker verification using prototypical momentum contrastive learning,” in ICASSP 2021. 2021, pp. 6723–6727, IEEE.
  • [2] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML 2020. 2020, vol. 119 of Proceedings of Machine Learning Research, pp. 1597–1607, PMLR.
  • [3] Aäron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” CoRR, vol. abs/1807.03748, 2018.
  • [4] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick, “Momentum contrast for unsupervised visual representation learning,” in CVPR 2020. 2020, pp. 9726–9735, Computer Vision Foundation / IEEE.
  • [5] Haoran Zhang, Yuexian Zou, and Helin Wang, “Contrastive self-supervised learning for text-independent speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021. 2021, pp. 6713–6717, IEEE.
  • [6] Xin Zhang, Minho **, Roger Cheng, Ruirui Li, Eunjung Han, and Andreas Stolcke, “Contrastive-mixup learning for improved speaker verification,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022. 2022, pp. 7652–7656, IEEE.
  • [7] Mufan Sang, Haoqi Li, Fang Liu, Andrew O. Arnold, and Li Wan, “Self-supervised speaker verification with simple siamese network and self-supervised regularization,” in ICASSP 2022. 2022, pp. 6127–6131, IEEE.
  • [8] Jean-Bastien Grill et al., “Bootstrap your own latent - A new approach to self-supervised learning,” in NeurIPS 2020, 2020.
  • [9] Mathilde Caron et al., “Emerging properties in self-supervised vision transformers,” in ICCV 2021. 2021, pp. 9630–9640, IEEE.
  • [10] Hee-Soo et al. Heo, “Self-supervised curriculum learning for speaker verification,” arXiv preprint arXiv:2203.14525, 2022.
  • [11] Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, and Qian Chen, “Pushing the limits of self-supervised speaker verification using regularized distillation framework,” in ICASSP 2023. IEEE, 2023, pp. 1–5.
  • [12] Zhengyang Chen, Yao Qian, Bing Han, Yanmin Qian, and Michael Zeng, “A comprehensive study on self-supervised distillation for speaker representation learning,” in SLT 2022. 2022, pp. 599–604, IEEE.
  • [13] Bing Han, Zhengyang Chen, and Yanmin Qian, “Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification,” CoRR, vol. abs/2304.05754, 2023.
  • [14] Ruijie Tao, Kong Aik Lee, Rohan Kumar Das, Ville Hautamäki, and Haizhou Li, “Self-supervised speaker recognition with loss-gated learning,” in ICASSP 2022. 2022, pp. 6142–6146, IEEE.
  • [15] Chunlei Zhang and Dong Yu, “C3-DINO: joint contrastive and non-contrastive self-supervised learning for speaker verification,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1273–1283, 2022.
  • [16] Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in Interspeech 2020. 2020, pp. 3830–3834, ISCA.
  • [17] Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, Francis R. Bach and David M. Blei, Eds. 2015, vol. 37 of JMLR Workshop and Conference Proceedings, pp. 448–456, JMLR.org.
  • [18] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter, “Self-normalizing neural networks,” in NIPS 2017, 2017, pp. 971–980.
  • [19] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018. 2018, pp. 5329–5333, IEEE.
  • [20] Daniel S. Park et al., “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech 2019. 2019, pp. 2613–2617, ISCA.
  • [21] Marco Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” in NIPS 2013, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger, Eds., 2013, pp. 2292–2300.
  • [22] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” in Interspeech. 2018, pp. 1086–1090, ISCA.
  • [23] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Interspeech 2017. 2017, pp. 2616–2620, ISCA.
  • [24] David Snyder, Guoguo Chen, and Daniel Povey, “MUSAN: A music, speech, and noise corpus,” CoRR, vol. abs/1510.08484, 2015.
  • [25] Tom Ko et al., “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP 2017. 2017, pp. 5220–5224, IEEE.
  • [26] Shun-ichi Amari, “Backpropagation and stochastic gradient descent method,” Neurocomputing, vol. 5, no. 3, pp. 185–196, 1993.
  • [27] **gu Kang, Jaesung Huh, Hee Soo Heo, and Joon Son Chung, “Augmentation adversarial training for self-supervised speaker representation learning,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1253–1262, 2022.
  • [28] Laurens Van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008.