Skip to main content

Showing 1–18 of 18 results for author: Stafylakis, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.12622  [pdf, ps, other

    eess.AS

    Challenging margin-based speaker embedding extractors by using the variational information bottleneck

    Authors: Themos Stafylakis, Anna Silnova, Johan Rohdin, Oldrich Plchot, Lukas Burget

    Abstract: Speaker embedding extractors are typically trained using a classification loss over the training speakers. During the last few years, the standard softmax/cross-entropy loss has been replaced by the margin-based losses, yielding significant improvements in speaker recognition accuracy. Motivated by the fact that the margin merely reduces the logit of the target speaker during training, we consider… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2402.19325  [pdf, other

    cs.SD eess.AS

    Do End-to-End Neural Diarization Attractors Need to Encode Speaker Characteristic Information?

    Authors: Lin Zhang, Themos Stafylakis, Federico Landini, Mireia Diez, Anna Silnova, Lukáš Burget

    Abstract: In this paper, we apply the variational information bottleneck approach to end-to-end neural diarization with encoder-decoder attractors (EEND-EDA). This allows us to investigate what information is essential for the model. EEND-EDA utilizes attractors, vector representations of speakers in a conversation. Our analysis shows that, attractors do not necessarily have to contain speaker characteristi… ▽ More

    Submitted 20 June, 2024; v1 submitted 29 February, 2024; originally announced February 2024.

    Comments: Accepted to Odyssey 2024. This arXiv version includes an appendix for more visualizations. Code: https://github.com/BUTSpeechFIT/EENDEDA_VIB

  3. arXiv:2312.04324  [pdf, other

    eess.AS cs.SD

    DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors

    Authors: Federico Landini, Mireia Diez, Themos Stafylakis, Lukáš Burget

    Abstract: Until recently, the field of speaker diarization was dominated by cascaded systems. Due to their limitations, mainly regarding overlapped speech and cumbersome pipelines, end-to-end models have gained great popularity lately. One of the most successful models is end-to-end neural diarization with encoder-decoder based attractors (EEND-EDA). In this work, we replace the EDA module with a Perceiver-… ▽ More

    Submitted 1 June, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing

  4. arXiv:2305.10517  [pdf, other

    eess.AS

    Improving Speaker Verification with Self-Pretrained Transformer Models

    Authors: Junyi Peng, Oldřich Plchot, Themos Stafylakis, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models a… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  5. arXiv:2211.01756  [pdf, other

    eess.AS cs.SD

    Speech-based emotion recognition with self-supervised models using attentive channel-wise correlations and label smoothing

    Authors: Sofoklis Kakouros, Themos Stafylakis, Ladislav Mosner, Lukas Burget

    Abstract: When recognizing emotions from speech, we encounter two common problems: how to optimally capture emotion-relevant information from the speech signal and how to best quantify or categorize the noisy subjective emotion labels. Self-supervised pre-trained representations can robustly capture information from speech enabling state-of-the-art results in many downstream tasks including emotion recognit… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to IEEE-ICASSP 2023

  6. arXiv:2210.16032  [pdf, other

    eess.AS cs.SD eess.SP

    Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

    Authors: Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldřich Plchot, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a compreh… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: submitted to ICASSP2023

  7. arXiv:2210.09513  [pdf, other

    eess.AS cs.SD

    Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

    Authors: Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

    Abstract: Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alte… ▽ More

    Submitted 15 October, 2022; originally announced October 2022.

    Comments: Accepted at IEEE-SLT 2022

  8. arXiv:2210.05291  [pdf, other

    cs.CL cs.SD eess.AS

    On the Use of Semantically-Aligned Speech Representations for Spoken Language Understanding

    Authors: Gaëlle Laperrière, Valentin Pelloin, Mickaël Rouvier, Themos Stafylakis, Yannick Estève

    Abstract: In this paper we examine the use of semantically-aligned speech representations for end-to-end spoken language understanding (SLU). We employ the recently-introduced SAMU-XLSR model, which is designed to generate a single embedding that captures the semantics at the utterance level, semantically aligned across different languages. This model combines the acoustic frame-level speech representation… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: Accepted in IEEE SLT 2022. This work was performed using HPC resources from GENCI/IDRIS (grant 2022 AD011012565) and received funding from the EU H2020 research and innovation programme under the Marie Sklodowska-Curie ESPERANTO project (grant agreement No 101007666), through the SELMA project (grant No 957017) and from the French ANR through the AISSPER project (ANR-19-CE23-0004)

  9. arXiv:2210.01273  [pdf, other

    eess.AS

    An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

    Authors: Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

    Abstract: In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT2022

  10. arXiv:2203.15436  [pdf, other

    eess.AS

    Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

    Authors: Themos Stafylakis, Ladislav Mošner, Oldřich Plchot, Johan Rohdin, Anna Silnova, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parame… ▽ More

    Submitted 9 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at Interspeech 2022

  11. arXiv:2203.10300  [pdf, other

    eess.AS

    Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

    Authors: Anna Silnova, Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Pavel Matejka, Lukas Burget, Ondrej Glembek, Niko Brummer

    Abstract: In this paper, we analyze the behavior and performance of speaker embeddings and the back-end scoring model under domain and language mismatch. We present our findings regarding ResNet-based speaker embedding architectures and show that reduced temporal stride yields improved performance. We then consider a PLDA back-end and show how a combination of small speaker subspace, language-dependent PLDA… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: Submitted to Odyssey 2022, under review

  12. arXiv:2104.02571  [pdf, ps, other

    eess.AS cs.CV

    Speaker embeddings by modeling channel-wise correlations

    Authors: Themos Stafylakis, Johan Rohdin, Lukas Burget

    Abstract: Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statisti… ▽ More

    Submitted 7 July, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    Comments: Accepted at Interspeech 2021

  13. arXiv:2009.01225  [pdf, other

    cs.CV eess.AS

    Seeing wake words: Audio-visual Keyword Spotting

    Authors: Liliane Momeni, Triantafyllos Afouras, Themos Stafylakis, Samuel Albanie, Andrew Zisserman

    Abstract: The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for in the wild videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) patt… ▽ More

    Submitted 2 September, 2020; originally announced September 2020.

  14. arXiv:2004.04096  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    Probabilistic embeddings for speaker diarization

    Authors: Anna Silnova, Niko Brümmer, Johan Rohdin, Themos Stafylakis, Lukáš Burget

    Abstract: Speaker embeddings (x-vectors) extracted from very short segments of speech have recently been shown to give competitive performance in speaker diarization. We generalize this recipe by extracting from each speech segment, in parallel with the x-vector, also a diagonal precision matrix, thus providing a path for the propagation of information about the quality of the speech segment into a PLDA sco… ▽ More

    Submitted 6 November, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

    Comments: Awarded: Jack Godfrey Best Student Paper Award, at Odyssey 2020: The Speaker and Language Recognition Workshop, Tokio

  15. arXiv:1910.10599  [pdf, other

    eess.AS

    End-to-end architectures for ASR-free spoken language understanding

    Authors: Elisavet Palogiannidi, Ioannis Gkinis, George Mastrapas, Petr Mizera, Themos Stafylakis

    Abstract: Spoken Language Understanding (SLU) is the problem of extracting the meaning from speech utterances. It is typically addressed as a two-step problem, where an Automatic Speech Recognition (ASR) model is employed to convert speech into text, followed by a Natural Language Understanding (NLU) model to extract meaning from the decoded text. Recently, end-to-end approaches were emerged, aiming at unif… ▽ More

    Submitted 1 May, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

    Comments: Accepted at ICASSP-2020

  16. arXiv:1907.03454  [pdf, other

    eess.AS cs.CR

    Privacy-Preserving Speaker Recognition with Cohort Score Normalisation

    Authors: Andreas Nautsch, Jose Patino, Amos Treiber, Themos Stafylakis, Petr Mizera, Massimiliano Todisco, Thomas Schneider, Nicholas Evans

    Abstract: In many voice biometrics applications there is a requirement to preserve privacy, not least because of the recently enforced General Data Protection Regulation (GDPR). Though progress in bringing privacy preservation to voice biometrics is lagging behind developments in other biometrics communities, recent years have seen rapid progress, with secure computation mechanisms such as homomorphic encry… ▽ More

    Submitted 8 July, 2019; originally announced July 2019.

    Journal ref: Proc. Interspeech 2019

  17. arXiv:1811.02331  [pdf, other

    eess.AS cs.SD

    Speaker verification using end-to-end adversarial language adaptation

    Authors: Johan Rohdin, Themos Stafylakis, Anna Silnova, Hossein Zeinali, Lukas Burget, Oldrich Plchot

    Abstract: In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target dom… ▽ More

    Submitted 6 November, 2018; originally announced November 2018.

  18. arXiv:1811.02066  [pdf, ps, other

    cs.SD cs.CL eess.AS

    How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

    Authors: Hossein Zeinali, Lukas Burget, Johan Rohdin, Themos Stafylakis, Jan Cernocky

    Abstract: Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, diff… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.