Skip to main content

Showing 1–23 of 23 results for author: Plchot, O

.
  1. arXiv:2406.12622  [pdf, ps, other

    eess.AS

    Challenging margin-based speaker embedding extractors by using the variational information bottleneck

    Authors: Themos Stafylakis, Anna Silnova, Johan Rohdin, Oldrich Plchot, Lukas Burget

    Abstract: Speaker embedding extractors are typically trained using a classification loss over the training speakers. During the last few years, the standard softmax/cross-entropy loss has been replaced by the margin-based losses, yielding significant improvements in speaker recognition accuracy. Motivated by the fact that the margin merely reduces the logit of the target speaker during training, we consider… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  2. arXiv:2402.13200  [pdf, other

    eess.AS cs.SD

    Probing Self-supervised Learning Models with Target Speech Extraction

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

    Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction c… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  3. arXiv:2402.13199  [pdf, other

    eess.AS cs.SD

    Target Speech Extraction with Pre-trained Self-supervised Learning Models

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky

    Abstract: Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixt… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2305.10517  [pdf, other

    eess.AS

    Improving Speaker Verification with Self-Pretrained Transformer Models

    Authors: Junyi Peng, Oldřich Plchot, Themos Stafylakis, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models a… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  5. arXiv:2210.16032  [pdf, other

    eess.AS cs.SD eess.SP

    Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

    Authors: Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldřich Plchot, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a compreh… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: submitted to ICASSP2023

  6. arXiv:2210.09513  [pdf, other

    eess.AS cs.SD

    Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

    Authors: Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

    Abstract: Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alte… ▽ More

    Submitted 15 October, 2022; originally announced October 2022.

    Comments: Accepted at IEEE-SLT 2022

  7. arXiv:2210.01273  [pdf, other

    eess.AS

    An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

    Authors: Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

    Abstract: In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT2022

  8. arXiv:2203.15436  [pdf, other

    eess.AS

    Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

    Authors: Themos Stafylakis, Ladislav Mošner, Oldřich Plchot, Johan Rohdin, Anna Silnova, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parame… ▽ More

    Submitted 9 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at Interspeech 2022

  9. arXiv:2203.14893  [pdf, ps, other

    stat.ML cs.LG

    Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings

    Authors: Niko Brümmer, Albert Swart, Ladislav Mošner, Anna Silnova, Oldřich Plchot, Themos Stafylakis, Lukáš Burget

    Abstract: In speaker recognition, where speech segments are mapped to embeddings on the unit hypersphere, two scoring backends are commonly used, namely cosine scoring or PLDA. Both have advantages and disadvantages, depending on the context. Cosine scoring follows naturally from the spherical geometry, but for PLDA the blessing is mixed -- length normalization Gaussianizes the between-speaker distribution,… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: Submitted to Interspeech 2022

  10. arXiv:2203.10300  [pdf, other

    eess.AS

    Analyzing speaker verification embedding extractors and back-ends under language and channel mismatch

    Authors: Anna Silnova, Themos Stafylakis, Ladislav Mosner, Oldrich Plchot, Johan Rohdin, Pavel Matejka, Lukas Burget, Ondrej Glembek, Niko Brummer

    Abstract: In this paper, we analyze the behavior and performance of speaker embeddings and the back-end scoring model under domain and language mismatch. We present our findings regarding ResNet-based speaker embedding architectures and show that reduced temporal stride yields improved performance. We then consider a PLDA back-end and show how a combination of small speaker subspace, language-dependent PLDA… ▽ More

    Submitted 19 March, 2022; originally announced March 2022.

    Comments: Submitted to Odyssey 2022, under review

  11. arXiv:2111.06458  [pdf, other

    eess.AS cs.LG cs.SD

    MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

    Authors: Ladislav Mošner, Oldřich Plchot, Lukáš Burget, Jan Černocký

    Abstract: Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem o… ▽ More

    Submitted 11 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022

  12. arXiv:2002.11356  [pdf, ps, other

    eess.AS

    BUT System for the Second DIHARD Speech Diarization Challenge

    Authors: Federico Landini, Shuai Wang, Mireia Diez, Lukáš Burget, Pavel Matějka, Kateřina Žmolíková, Ladislav Mošner, Anna Silnova, Oldřich Plchot, Ondřej Novotný, Hossein Zeinali, Johan Rohdin

    Abstract: This paper describes the winning systems developed by the BUT team for the four tracks of the Second DIHARD Speech Diarization Challenge. For tracks 1 and 2 the systems were mainly based on performing agglomerative hierarchical clustering (AHC) of x-vectors, followed by another x-vector clustering based on Bayes hidden Markov model and variational Bayes inference. We provide a comparison of the im… ▽ More

    Submitted 26 February, 2020; originally announced February 2020.

  13. arXiv:1910.12592  [pdf, ps, other

    eess.AS cs.CL cs.SD

    BUT System Description to VoxCeleb Speaker Recognition Challenge 2019

    Authors: Hossein Zeinali, Shuai Wang, Anna Silnova, Pavel Matějka, Oldřich Plchot

    Abstract: In this report, we describe the submission of Brno University of Technology (BUT) team to the VoxCeleb Speaker Recognition Challenge (VoxSRC) 2019. We also provide a brief analysis of different systems on VoxCeleb-1 test sets. Submitted systems for both Fixed and Open conditions are a fusion of 4 Convolutional Neural Network (CNN) topologies. The first and second networks have ResNet34 topology an… ▽ More

    Submitted 16 October, 2019; originally announced October 2019.

  14. arXiv:1910.08847  [pdf, ps, other

    eess.AS

    BUT System Description for DIHARD Speech Diarization Challenge 2019

    Authors: Federico Landini, Shuai Wang, Mireia Diez, Lukáš Burget, Pavel Matějka, Kateřina Žmolíková, Ladislav Mošner, Oldřich Plchot, Ondřej Novotný, Hossein Zeinali, Johan Rohdin

    Abstract: This paper describes the systems developed by the BUT team for the four tracks of the second DIHARD speech diarization challenge. For tracks 1 and 2 the systems were based on performing agglomerative hierarchical clustering (AHC) over x-vectors, followed by the Bayesian Hidden Markov Model (HMM) with eigenvoice priors applied at x-vector level followed by the same approach applied at frame level.… ▽ More

    Submitted 19 October, 2019; originally announced October 2019.

  15. Learning document embeddings along with their uncertainties

    Authors: Santosh Kesiraju, Oldřich Plchot, Lukáš Burget, Suryakanth V Gangashetty

    Abstract: Majority of the text modelling techniques yield only point-estimates of document embeddings and lack in capturing the uncertainty of the estimates. These uncertainties give a notion of how well the embeddings represent a document. We present Bayesian subspace multinomial model (Bayesian SMM), a generative log-linear model that learns to represent documents in the form of Gaussian distributions, th… ▽ More

    Submitted 18 October, 2019; v1 submitted 20 August, 2019; originally announced August 2019.

  16. arXiv:1907.06112  [pdf, ps, other

    eess.AS cs.CL cs.SD

    BUT VOiCES 2019 System Description

    Authors: Hossein Zeinali, Pavel Matějka, Ladislav Mošner, Oldřich Plchot, Anna Silnova, Ondřej Novotný, Ján Profant, Ondřej Glembek, Lukáš Burget

    Abstract: This is a description of our effort in VOiCES 2019 Speaker Recognition challenge. All systems in the fixed condition are based on the x-vector paradigm with different features and DNN topologies. The single best system reaches 1.2% EER and a fusion of 3 systems yields 1.0% EER, which is 15% relative improvement. The open condition allowed us to use external data which we did for the PLDA adaptatio… ▽ More

    Submitted 13 July, 2019; originally announced July 2019.

  17. arXiv:1904.04235  [pdf, other

    eess.AS cs.SD

    Factorization of Discriminatively Trained i-vector Extractor for Speaker Recognition

    Authors: Ondrej Novotny, Oldrich Plchot, Ondrej Glembek, Lukas Burget

    Abstract: In this work, we continue in our research on i-vector extractor for speaker verification (SV) and we optimize its architecture for fast and effective discriminative training. We were motivated by computational and memory requirements caused by the large number of parameters of the original generative i-vector model. Our aim is to preserve the power of the original generative model, and at the same… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria. arXiv admin note: substantial text overlap with arXiv:1810.13183

  18. arXiv:1904.03486  [pdf, other

    cs.CV

    Self-supervised speaker embeddings

    Authors: Themos Stafylakis, Johan Rohdin, Oldrich Plchot, Petr Mizera, Lukas Burget

    Abstract: Contrary to i-vectors, speaker embeddings such as x-vectors are incapable of leveraging unlabelled utterances, due to the classification loss over training speakers. In this paper, we explore an alternative training strategy to enable the use of unlabelled utterances in training. We propose to train speaker embedding extractors via reconstructing the frames of a target speech segment, given the in… ▽ More

    Submitted 23 April, 2019; v1 submitted 6 April, 2019; originally announced April 2019.

    Comments: Preprint. Submitted to Interspeech 2019. Updated results compared to first version and minor corrections

  19. arXiv:1811.07629  [pdf, other

    eess.AS cs.SD

    Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

    Authors: Ondrej Novotny, Oldrich Plchot, Ondrej Glembek, Jan "Honza" Cernocky, Lukas Burget

    Abstract: In this work, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker verification (SV) system. We start our approach by carefully designing a data augmentation process to cover wide range of acoustic conditions and obtain rich training data for various components of our SV system. We augment several well-k… ▽ More

    Submitted 19 November, 2018; originally announced November 2018.

    Comments: 16 pages, 7 figures, Submission to Computer Speech and Language, special issue on Speaker and language characterization and recognition

  20. arXiv:1811.02938  [pdf, other

    eess.AS cs.SD

    On the use of DNN Autoencoder for Robust Speaker Recognition

    Authors: Ondrej Novotny, Oldrich Plchot, Pavel Matejka, Ondrej Glembek

    Abstract: In this paper, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker recognition system. We started with augmenting the Fisher database with artificially noised and reverberated data and we trained the autoencoder to map noisy and reverberated speech to its clean version. We use the autoencoder as a prepr… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: 5 pages, 1 figure

  21. arXiv:1811.02331  [pdf, other

    eess.AS cs.SD

    Speaker verification using end-to-end adversarial language adaptation

    Authors: Johan Rohdin, Themos Stafylakis, Anna Silnova, Hossein Zeinali, Lukas Burget, Oldrich Plchot

    Abstract: In this paper we investigate the use of adversarial domain adaptation for addressing the problem of language mismatch between speaker recognition corpora. In the context of speaker verification, adversarial domain adaptation methods aim at minimizing certain divergences between the distribution that the utterance-level features follow (i.e. speaker embeddings) when drawn from source and target dom… ▽ More

    Submitted 6 November, 2018; originally announced November 2018.

  22. arXiv:1810.13183  [pdf, other

    eess.AS cs.SD

    Discriminatively Re-trained i-vector Extractor for Speaker Recognition

    Authors: Ondrej Novotny, Oldrich Plchot, Ondrej Glembek, Lukas Burget, Pavel Matejka

    Abstract: In this work we revisit discriminative training of the i-vector extractor component in the standard speaker verification (SV) system. The motivation of our research lies in the robustness and stability of this large generative model, which we want to preserve, and focus its power towards any intended SV task. We show that after generative initialization of the i-vector extractor, we can further re… ▽ More

    Submitted 31 October, 2018; originally announced October 2018.

    Comments: 5 pages, 1 figure, submitted to ICASSP 2019

  23. arXiv:1710.02369  [pdf, other

    eess.AS cs.SD

    End-to-end DNN Based Speaker Recognition Inspired by i-vector and PLDA

    Authors: Johan Rohdin, Anna Silnova, Mireia Diez, Oldrich Plchot, Pavel Matejka, Lukas Burget

    Abstract: Recently several end-to-end speaker verification systems based on deep neural networks (DNNs) have been proposed. These systems have been proven to be competitive for text-dependent tasks as well as for text-independent tasks with short utterances. However, for text-independent tasks with longer utterances, end-to-end systems are still outperformed by standard i-vector + PLDA systems. In this work… ▽ More

    Submitted 8 January, 2018; v1 submitted 6 October, 2017; originally announced October 2017.