Skip to main content

Showing 1–38 of 38 results for author: Cernocky, J

.
  1. arXiv:2403.07767  [pdf, ps, other

    eess.AS cs.LG eess.SP

    Beyond the Labels: Unveiling Text-Dependency in Paralinguistic Speech Recognition Datasets

    Authors: Jan Pešán, Santosh Kesiraju, Lukáš Burget, Jan ''Honza'' Černocký

    Abstract: Paralinguistic traits like cognitive load and emotion are increasingly recognized as pivotal areas in speech recognition research, often examined through specialized datasets like CLSE and IEMOCAP. However, the integrity of these datasets is seldom scrutinized for text-dependency. This paper critically evaluates the prevalent assumption that machine learning models trained on such datasets genuine… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  2. arXiv:2402.13200  [pdf, other

    eess.AS cs.SD

    Probing Self-supervised Learning Models with Target Speech Extraction

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky

    Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction c… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop

  3. arXiv:2402.13199  [pdf, other

    eess.AS cs.SD

    Target Speech Extraction with Pre-trained Self-supervised Learning Models

    Authors: Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky

    Abstract: Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixt… ▽ More

    Submitted 17 February, 2024; originally announced February 2024.

    Comments: Accepted to ICASSP 2024

  4. arXiv:2309.08377  [pdf, other

    eess.AS cs.CL cs.SD

    DiaCorrect: Error Correction Back-end For Speaker Diarization

    Authors: Jiangyu Han, Federico Landini, Johan Rohdin, Mireia Diez, Lukas Burget, Yuhang Cao, Heng Lu, Jan Cernocky

    Abstract: In this work, we propose an error correction framework, named DiaCorrect, to refine the output of a diarization system in a simple yet effective way. This method is inspired by error correction techniques in automatic speech recognition. Our model consists of two parallel convolutional encoders and a transform-based decoder. By exploiting the interactions between the input recording and the initia… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  5. arXiv:2308.08027  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Open Vocabulary Keyword Search With Multilingual Neural Representations

    Authors: Bolaji Yusuf, Jan Cernocky, Murat Saraclar

    Abstract: Conventional keyword search systems operate on automatic speech recognition (ASR) outputs, which causes them to have a complex indexing and search pipeline. This has led to interest in ASR-free approaches to simplify the search procedure. We recently proposed a neural ASR-free keyword search model which achieves competitive performance while maintaining an efficient and simplified pipeline, where… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 2023

    Journal ref: in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 3070-3080, 2023

  6. arXiv:2305.10517  [pdf, other

    eess.AS

    Improving Speaker Verification with Self-Pretrained Transformer Models

    Authors: Junyi Peng, Oldřich Plchot, Themos Stafylakis, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, fine-tuning large pre-trained Transformer models using downstream datasets has received a rising interest. Despite their success, it is still challenging to disentangle the benefits of large-scale datasets and Transformer structures from the limitations of the pre-training. In this paper, we introduce a hierarchical training approach, named self-pretraining, in which Transformer models a… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

    Comments: Accepted to Interspeech 2023

  7. Neural Target Speech Extraction: An Overview

    Authors: Katerina Zmolikova, Marc Delcroix, Tsubasa Ochiai, Keisuke Kinoshita, Jan Černocký, Dong Yu

    Abstract: Humans can listen to a target speaker even in challenging acoustic conditions that have noise, reverberation, and interfering speakers. This phenomenon is known as the cocktail-party effect. For decades, researchers have focused on approaching the listening ability of humans. One critical issue is handling interfering speakers because the target and non-target speech signals share similar characte… ▽ More

    Submitted 30 January, 2023; originally announced January 2023.

    Comments: Submitted to IEEE Signal Processing Magazine on Apr. 25, 2022, and accepted on Jan. 12, 2023

  8. arXiv:2211.04054  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

    Authors: Juan Zuluaga-Gomez, Karel Veselý, Igor Szöke, Alexander Blatt, Petr Motlicek, Martin Kocour, Mickael Rigault, Khalid Choukri, Amrutha Prasad, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Claudia Cevenini, Pavel Kolčárek, Allan Tart, Jan Černocký, Dietrich Klakow

    Abstract: Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-h… ▽ More

    Submitted 15 June, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: Manuscript under review; The code is available at: https://github.com/idiap/atco2-corpus

  9. arXiv:2210.16032  [pdf, other

    eess.AS cs.SD eess.SP

    Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

    Authors: Junyi Peng, Themos Stafylakis, Rongzhi Gu, Oldřich Plchot, Ladislav Mošner, Lukáš Burget, Jan Černocký

    Abstract: Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the pre-trained model, which becomes prohibitive as the model size grows and sometimes results in overfitting on small datasets. In this paper, we conduct a compreh… ▽ More

    Submitted 28 October, 2022; originally announced October 2022.

    Comments: submitted to ICASSP2023

  10. arXiv:2210.09513  [pdf, other

    eess.AS cs.SD

    Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

    Authors: Themos Stafylakis, Ladislav Mosner, Sofoklis Kakouros, Oldrich Plchot, Lukas Burget, Jan Cernocky

    Abstract: Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by using descriptive statistics, and in particular, using the first- and second-order statistics of representation coefficients. In this paper, we examine an alte… ▽ More

    Submitted 15 October, 2022; originally announced October 2022.

    Comments: Accepted at IEEE-SLT 2022

  11. arXiv:2210.01273  [pdf, other

    eess.AS

    An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification

    Authors: Junyi Peng, Oldrich Plchot, Themos Stafylakis, Ladislav Mosner, Lukas Burget, Jan Cernocky

    Abstract: In recent years, self-supervised learning paradigm has received extensive attention due to its great success in various down-stream tasks. However, the fine-tuning strategies for adapting those pre-trained models to speaker verification task have yet to be fully explored. In this paper, we analyze several feature extraction approaches built on top of a pre-trained model, as well as regularization… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: Accepted by SLT2022

  12. arXiv:2208.07091  [pdf, other

    cs.SD cs.LG eess.AS

    Analysis of impact of emotions on target speech extraction and speech separation

    Authors: Ján Švec, Kateřina Žmolíková, Martin Kocour, Marc Delcroix, Tsubasa Ochiai, Ladislav Mošner, Jan Černocký

    Abstract: Recently, the performance of blind speech separation (BSS) and target speech extraction (TSE) has greatly progressed. Most works, however, focus on relatively well-controlled conditions using, e.g., read speech. The performance may degrade in more realistic situations. One of the factors causing such degradation may be intrinsic speaker variability, such as emotions, occurring commonly in realisti… ▽ More

    Submitted 15 August, 2022; originally announced August 2022.

    Comments: Accepted to IWAENC 2022

  13. arXiv:2204.00770  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Speaker adaptation for Wav2vec2 based dysarthric ASR

    Authors: Murali Karthick Baskar, Tim Herzig, Diana Nguyen, Mireia Diez, Tim Polzehl, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: Dysarthric speech recognition has posed major challenges due to lack of training data and heavy mismatch in speaker characteristics. Recent ASR systems have benefited from readily available pretrained models such as wav2vec2 to improve the recognition performance. Speaker adaptation using fMLLR and xvectors have provided major gains for dysarthric speech with very little adaptation data. However,… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  14. arXiv:2203.15436  [pdf, other

    eess.AS

    Training Speaker Embedding Extractors Using Multi-Speaker Audio with Unknown Speaker Boundaries

    Authors: Themos Stafylakis, Ladislav Mošner, Oldřich Plchot, Johan Rohdin, Anna Silnova, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this paper, we demonstrate a method for training speaker embedding extractors using weak annotation. More specifically, we are using the full VoxCeleb recordings and the name of the celebrities appearing on each video without knowledge of the time intervals the celebrities appear in the video. We show that by combining a baseline speaker diarization algorithm that requires no training or parame… ▽ More

    Submitted 9 August, 2022; v1 submitted 29 March, 2022; originally announced March 2022.

    Comments: Accepted at Interspeech 2022

  15. arXiv:2112.13520  [pdf, other

    eess.AS

    DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation And Extraction

    Authors: Jiangyu Han, Yanhua Long, Lukas Burget, Jan Cernocky

    Abstract: In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions.… ▽ More

    Submitted 29 January, 2022; v1 submitted 27 December, 2021; originally announced December 2021.

    Comments: accepted by ICASSP 2022

  16. arXiv:2111.06458  [pdf, other

    eess.AS cs.LG cs.SD

    MultiSV: Dataset for Far-Field Multi-Channel Speaker Verification

    Authors: Ladislav Mošner, Oldřich Plchot, Lukáš Burget, Jan Černocký

    Abstract: Motivated by unconsolidated data situation and the lack of a standard benchmark in the field, we complement our previous efforts and present a comprehensive corpus designed for training and evaluating text-independent multi-channel speaker verification systems. It can be readily used also for experiments with dereverberation, denoising, and speech enhancement. We tackled the ever-present problem o… ▽ More

    Submitted 11 November, 2021; originally announced November 2021.

    Comments: Submitted to ICASSP 2022

  17. arXiv:2111.00009  [pdf, other

    eess.AS cs.LG cs.SD

    Revisiting joint decoding based multi-talker speech recognition with DNN acoustic model

    Authors: Martin Kocour, Kateřina Žmolíková, Lucas Ondel, Ján Švec, Marc Delcroix, Tsubasa Ochiai, Lukáš Burget, Jan Černocký

    Abstract: In typical multi-talker speech recognition systems, a neural network-based acoustic model predicts senone state posteriors for each speaker. These are later used by a single-talker decoder which is applied on each speaker-specific output stream separately. In this work, we argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly. We modify the aco… ▽ More

    Submitted 15 April, 2022; v1 submitted 31 October, 2021; originally announced November 2021.

    Comments: submitted to Interspeech 2022

  18. arXiv:2104.07474  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition

    Authors: Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Ramon Fernandez Astudillo, Jan "Honza'' Černocký

    Abstract: Self-supervised ASR-TTS models suffer in out-of-domain data conditions. Here we propose an enhanced ASR-TTS (EAT) model that incorporates two main features: 1) The ASR$\rightarrow$TTS direction is equipped with a language model reward to penalize the ASR hypotheses before forwarding it to TTS. 2) In the TTS$\rightarrow$ASR direction, a hyper-parameter is introduced to scale the attention context f… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

  19. arXiv:2104.02332  [pdf, other

    eess.AS

    Detecting English Speech in the Air Traffic Control Voice Communication

    Authors: Igor Szoke, Santosh Kesiraju, Ondrej Novotny, Martin Kocour, Karel Vesely, Jan "Honza" Cernocky

    Abstract: We launched a community platform for collecting the ATC speech world-wide in the ATCO2 project. Filtering out unseen non-English speech is one of the main components in the data processing pipeline. The proposed English Language Detection (ELD) system is based on the embeddings from Bayesian subspace multinomial model. It is trained on the word confusion network from an ASR system. It is robust, e… ▽ More

    Submitted 6 April, 2021; originally announced April 2021.

  20. arXiv:2101.12729  [pdf, other

    eess.AS cs.CL

    BCN2BRNO: ASR System Fusion for Albayzin 2020 Speech to Text Challenge

    Authors: Martin Kocour, Guillermo Cámbara, Jordi Luque, David Bonet, Mireia Farrús, Martin Karafiát, Karel Veselý, Jan ''Honza'' Ĉernocký

    Abstract: This paper describes joint effort of BUT and Telefónica Research on development of Automatic Speech Recognition systems for Albayzin 2020 Challenge. We compare approaches based on either hybrid or end-to-end models. In hybrid modelling, we explore the impact of SpecAugment layer on performance. For end-to-end modelling, we used a convolutional neural network with gated linear units (GLUs). The per… ▽ More

    Submitted 29 January, 2021; originally announced January 2021.

    Comments: fusion, end-to-end model, hybrid model, semisupervised, automatic speech recognition, convolutional neural network

  21. arXiv:2011.11984  [pdf, other

    eess.AS

    Integration of variational autoencoder and spatial clustering for adaptive multi-channel neural speech separation

    Authors: Katerina Zmolikova, Marc Delcroix, Lukáš Burget, Tomohiro Nakatani, Jan "Honza" Černocký

    Abstract: In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed speech or discriminative neural networks. In our work,… ▽ More

    Submitted 24 November, 2020; originally announced November 2020.

    Comments: 8 pages, 3 figures, to be published in SLT2021

  22. arXiv:2011.03115  [pdf, ps, other

    eess.AS cs.LG cs.SD

    A Hierarchical Subspace Model for Language-Attuned Acoustic Unit Discovery

    Authors: Bolaji Yusuf, Lucas Ondel, Lukas Burget, Jan Cernocky, Murat Saraclar

    Abstract: In this work, we propose a hierarchical subspace model for acoustic unit discovery. In this approach, we frame the task as one of learning embeddings on a low-dimensional phonetic subspace, and simultaneously specify the subspace itself as an embedding on a hyper-subspace. We train the hyper-subspace on a set of transcribed languages and transfer it to the target language. In the target language,… ▽ More

    Submitted 9 November, 2020; v1 submitted 4 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021

  23. arXiv:2010.11593  [pdf, other

    cs.CL cs.AI

    A Technical Report: BUT Speech Translation Systems

    Authors: Hari Krishna Vydana, Lukas Burget, Jan Cernocky

    Abstract: The paper describes the BUT's speech translation systems. The systems are English$\longrightarrow$German offline speech translation systems. The systems are based on our previous works \cite{Jointly_trained_transformers}. Though End-to-End and cascade~(ASR-MT) spoken language translation~(SLT) systems are reaching comparable performances, a large degradation is observed when translating ASR hypoth… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  24. arXiv:2007.01359  [pdf, ps, other

    cs.CL

    A Bayesian Multilingual Document Model for Zero-shot Topic Identification and Discovery

    Authors: Santosh Kesiraju, Sangeet Sagar, Ondřej Glembek, Lukáš Burget, Ján Černocký, Suryakanth V Gangashetty

    Abstract: In this paper, we present a Bayesian multilingual document model for learning language-independent document embeddings. The model is an extension of BaySMM [Kesiraju et al 2020] to the multilingual scenario. It learns to represent the document embeddings in the form of Gaussian distributions, thereby encoding the uncertainty in its covariance. We propagate the learned uncertainties through linear… ▽ More

    Submitted 23 March, 2024; v1 submitted 2 July, 2020; originally announced July 2020.

  25. arXiv:2001.11360  [pdf, ps, other

    eess.AS cs.LG cs.SD

    BUT Opensat 2019 Speech Recognition System

    Authors: Martin Karafiát, Murali Karthick Baskar, Igor Szöke, Hari Krishna Vydana, Karel Veselý, Jan "Honza'' Černocký

    Abstract: The paper describes the BUT Automatic Speech Recognition (ASR) systems submitted for OpenSAT evaluations under two domain categories such as low resourced languages and public safety communications. The first was challenging due to lack of training data, therefore various architectures and multilingual approaches were employed. The combination led to superior performance. The second domain was cha… ▽ More

    Submitted 30 January, 2020; originally announced January 2020.

    Comments: REJECTED in ICASSP 2020

  26. arXiv:1912.03627  [pdf, ps, other

    eess.AS cs.CL cs.SD

    A Multi Purpose and Large Scale Speech Corpus in Persian and English for Speaker and Speech Recognition: the DeepMine Database

    Authors: Hossein Zeinali, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: DeepMine is a speech database in Persian and English designed to build and evaluate text-dependent, text-prompted, and text-independent speaker verification, as well as Persian speech recognition systems. It contains more than 1850 speakers and 540 thousand recordings overall, more than 480 hours of speech are transcribed. It is the first public large-scale speaker verification database in Persian… ▽ More

    Submitted 8 December, 2019; originally announced December 2019.

  27. arXiv:1907.12908  [pdf, ps, other

    cs.CV cs.AI cs.CR

    Detecting Spoofing Attacks Using VGG and SincNet: BUT-Omilia Submission to ASVspoof 2019 Challenge

    Authors: Hossein Zeinali, Themos Stafylakis, Georgia Athanasopoulou, Johan Rohdin, Ioannis Gkinis, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this paper, we present the system description of the joint efforts of Brno University of Technology (BUT) and Omilia -- Conversational Intelligence for the ASVSpoof2019 Spoofing and Countermeasures Challenge. The primary submission for Physical access (PA) is a fusion of two VGG networks, trained on single and two-channels features. For Logical access (LA), our primary system is a fusion of VGG… ▽ More

    Submitted 13 July, 2019; originally announced July 2019.

  28. arXiv:1907.07127  [pdf, ps, other

    eess.AS cs.SD

    Acoustic Scene Classification Using Fusion of Attentive Convolutional Neural Networks for DCASE2019 Challenge

    Authors: Hossein Zeinali, Lukáš Burget, Jan "Honza'' Černocký

    Abstract: In this report, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2019 challenge are described. Also, the analysis of different methods is provided. The proposed approach is a fusion of three different Convolutional Neural Network (CNN) topologies. The first one is a VGG like two-dimensional CNNs. The second one is again a two-dim… ▽ More

    Submitted 13 July, 2019; originally announced July 2019.

    Comments: arXiv admin note: text overlap with arXiv:1810.04273

  29. arXiv:1905.01152  [pdf, ps, other

    eess.AS cs.CL cs.IR cs.LG cs.SD

    Semi-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

    Authors: Murali Karthick Baskar, Shinji Watanabe, Ramon Astudillo, Takaaki Hori, Lukáš Burget, Jan Černocký

    Abstract: Sequence-to-sequence automatic speech recognition (ASR) models require large quantities of data to attain high performance. For this reason, there has been a recent surge in interest for unsupervised and semi-supervised training in such models. This work builds upon recent results showing notable improvements in semi-supervised training using cycle-consistency and related techniques. Such techniqu… ▽ More

    Submitted 20 August, 2019; v1 submitted 30 April, 2019; originally announced May 2019.

    Comments: INTERSPEECH 2019

  30. arXiv:1904.03876  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Bayesian Subspace Hidden Markov Model for Acoustic Unit Discovery

    Authors: Lucas Ondel, Hari Krishna Vydana, Lukáš Burget, Jan Černocký

    Abstract: This work tackles the problem of learning a set of language specific acoustic units from unlabeled speech recordings given a set of labeled recordings from other languages. Our approach may be described by the following two steps procedure: first the model learns the notion of acoustic units from the labelled data and then the model uses its knowledge to find new acoustic units on the target langu… ▽ More

    Submitted 2 July, 2019; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Accepted to Interspeech 2019 * corrected typos * Recalculated the segmentation using +-2 frames tolerance to comply with other publications

  31. arXiv:1811.07629  [pdf, other

    eess.AS cs.SD

    Analysis of DNN Speech Signal Enhancement for Robust Speaker Recognition

    Authors: Ondrej Novotny, Oldrich Plchot, Ondrej Glembek, Jan "Honza" Cernocky, Lukas Burget

    Abstract: In this work, we present an analysis of a DNN-based autoencoder for speech enhancement, dereverberation and denoising. The target application is a robust speaker verification (SV) system. We start our approach by carefully designing a data augmentation process to cover wide range of acoustic conditions and obtain rich training data for various components of our SV system. We augment several well-k… ▽ More

    Submitted 19 November, 2018; originally announced November 2018.

    Comments: 16 pages, 7 figures, Submission to Computer Speech and Language, special issue on Speaker and language characterization and recognition

  32. Building and Evaluation of a Real Room Impulse Response Dataset

    Authors: Igor Szoke, Miroslav Skacel, Ladislav Mosner, Jakub Paliesek, Jan "Honza" Cernocky

    Abstract: This paper presents BUT ReverbDB - a dataset of real room impulse responses (RIR), background noises and re-transmitted speech data. The retransmitted data includes LibriSpeech test-clean, 2000 HUB5 English evaluation and part of 2010 NIST Speaker Recognition Evaluation datasets. We provide a detailed description of RIR collection (hardware, software, post-processing) that can serve as a "cook-boo… ▽ More

    Submitted 30 May, 2019; v1 submitted 16 November, 2018; originally announced November 2018.

    Comments: Submitted to Journal of Selected Topics in Signal Processing, November 2018

  33. arXiv:1811.03451  [pdf, other

    eess.AS cs.CL cs.LG

    Analysis of Multilingual Sequence-to-Sequence speech recognition systems

    Authors: Martin Karafiát, Murali Karthick Baskar, Shinji Watanabe, Takaaki Hori, Matthew Wiesner, Jan "Honza'' Černocký

    Abstract: This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set composed of Babel data, we first show the effectiveness of multi-lingual training with stacked bottle-neck (SBN) features. Then we explore various architectures and training strategies… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

    Comments: arXiv admin note: text overlap with arXiv:1810.03459

  34. arXiv:1811.02770  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Promising Accurate Prefix Boosting for sequence-to-sequence ASR

    Authors: Murali Karthick Baskar, Lukáš Burget, Shinji Watanabe, Martin Karafiát, Takaaki Hori, Jan Honza Černocký

    Abstract: In this paper, we present promising accurate prefix boosting (PAPB), a discriminative training technique for attention based sequence-to-sequence (seq2seq) ASR. PAPB is devised to unify the training and testing scheme in an effective manner. The training procedure involves maximizing the score of each partial correct sequence obtained during beam search compared to other hypotheses. The training o… ▽ More

    Submitted 7 November, 2018; originally announced November 2018.

  35. arXiv:1811.02066  [pdf, ps, other

    cs.SD cs.CL eess.AS

    How to Improve Your Speaker Embeddings Extractor in Generic Toolkits

    Authors: Hossein Zeinali, Lukas Burget, Johan Rohdin, Themos Stafylakis, Jan Cernocky

    Abstract: Recently, speaker embeddings extracted with deep neural networks became the state-of-the-art method for speaker verification. In this paper we aim to facilitate its implementation on a more generic toolkit than Kaldi, which we anticipate to enable further improvements on the method. We examine several tricks in training, such as the effects of normalizing input features and pooled statistics, diff… ▽ More

    Submitted 5 November, 2018; originally announced November 2018.

  36. arXiv:1810.04273  [pdf, ps, other

    eess.AS cs.SD

    Convolutional Neural Networks and x-vector Embedding for DCASE2018 Acoustic Scene Classification Challenge

    Authors: Hossein Zeinali, Lukas Burget, Jan Cernocky

    Abstract: In this paper, the Brno University of Technology (BUT) team submissions for Task 1 (Acoustic Scene Classification, ASC) of the DCASE-2018 challenge are described. Also, the analysis of different methods on the leaderboard set is provided. The proposed approach is a fusion of two different Convolutional Neural Network (CNN) topologies. The first one is the common two-dimensional CNNs which is mainl… ▽ More

    Submitted 1 October, 2018; originally announced October 2018.

    Journal ref: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018)

  37. arXiv:1809.11068  [pdf, other

    cs.SD cs.CL eess.AS

    Spoken Pass-Phrase Verification in the i-vector Space

    Authors: Hossein Zeinali, Lukas Burget, Hossein Sameti, Jan Cernocky

    Abstract: The task of spoken pass-phrase verification is to decide whether a test utterance contains the same phrase as given enrollment utterances. Beside other applications, pass-phrase verification can complement an independent speaker verification subsystem in text-dependent speaker verification. It can also be used for liveness detection by verifying that the user is able to correctly respond to a rand… ▽ More

    Submitted 28 September, 2018; originally announced September 2018.

    Journal ref: Proc. Odyssey 2018 The Speaker and Language Recognition Workshop

  38. arXiv:1808.01916  [pdf, other

    cs.CL cs.LG stat.ML

    Residual Memory Networks: Feed-forward approach to learn long temporal dependencies

    Authors: Murali Karthick Baskar, Martin Karafiat, Lukas Burget, Karel Vesely, Frantisek Grezl, Jan Honza Cernocky

    Abstract: Training deep recurrent neural network (RNN) architectures is complicated due to the increased network complexity. This disrupts the learning of higher order abstracts using deep RNN. In case of feed-forward networks training deep structures is simple and faster while learning long-term temporal information is not possible. In this paper we propose a residual memory neural network (RMN) architectu… ▽ More

    Submitted 6 August, 2018; originally announced August 2018.