Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

Chen, Yafeng; Zheng, Siqi; Chen, Qian

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2308.02774v2 (eess)

[Submitted on 5 Aug 2023 (v1), revised 20 Aug 2023 (this version, v2), latest version 25 Jun 2024 (v5)]

Title:Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

Authors:Yafeng Chen, Siqi Zheng, Qian Chen

View PDF

Abstract:Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. Previous studies have noted a substantial performance disparity between self-supervised and fully supervised approaches. In this paper, we propose an effective Self-Distillation network with Ensemble Prototypes (SDEP) to facilitate self-supervised speaker representation learning. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the SDEP framework in speaker verification. SDEP achieves a new SOTA on Voxceleb1 speaker verification evaluation benchmark ( i.e., equal error rate 1.94\%, 1.99\%, and 3.77\% for trial Vox1-O, Vox1-E and Vox1-H , respectively), discarding any speaker labels in the training phase. Code will be publicly available at this https URL.

Comments:	arXiv admin note: text overlap with arXiv:2211.04168
Subjects:	Audio and Speech Processing (eess.AS); Sound (cs.SD)
Cite as:	arXiv:2308.02774 [eess.AS]
	(or arXiv:2308.02774v2 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2308.02774

Submission history

From: Yafeng Chen [view email]
[v1] Sat, 5 Aug 2023 02:59:40 UTC (101 KB)
[v2] Sun, 20 Aug 2023 03:00:00 UTC (103 KB)
[v3] Tue, 12 Sep 2023 06:03:23 UTC (204 KB)
[v4] Thu, 27 Jun 2024 02:18:47 UTC (551 KB)
[v5] Tue, 25 Jun 2024 06:22:35 UTC (551 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Self-Distillation Network with Ensemble Prototypes: Learning Robust Speaker Representations without Supervision

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators