Skip to main content

Showing 1–21 of 21 results for author: Moro-Velázquez, L

Searching in archive eess. Search in all archives.
  1. arXiv:2406.07461  [pdf, other


    Noise-robust Speech Separation with Fast Generative Correction

    Authors: Helin Wang, Jesus Villalba, Laureano Moro-Velazquez, Jiarui Hai, Thomas Thebaud, Najim Dehak

    Abstract: Speech separation, the task of isolating multiple speech sources from a mixed audio signal, remains challenging in noisy environments. In this paper, we propose a generative correction method to enhance the output of a discriminative separator. By leveraging a generative corrector based on a diffusion model, we refine the separation process for single-channel mixture speech by removing noises and… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  2. arXiv:2311.10149  [pdf, other


    Improving fairness for spoken language understanding in atypical speech with Text-to-Speech

    Authors: Helin Wang, Venkatesh Ravichandran, Milind Rao, Becky Lammers, Myra Sydnor, Nicholas Maragakis, Ankur A. Butala, Jayne Zhang, Lora Clawson, Victoria Chovaz, Laureano Moro-Velazquez

    Abstract: Spoken language understanding (SLU) systems often exhibit suboptimal performance in processing atypical speech, typically caused by neurological conditions and motor impairments. Recent advancements in Text-to-Speech (TTS) synthesis-based augmentation for more fair SLU have struggled to accurately capture the unique vocal characteristics of atypical speakers, largely due to insufficient data. To a… ▽ More

    Submitted 16 November, 2023; originally announced November 2023.

    Comments: Accepted at SyntheticData4ML 2023 Oral

  3. arXiv:2309.04628  [pdf, other

    eess.AS cs.SD

    Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

    Authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak

    Abstract: Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP att… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  4. arXiv:2306.10588  [pdf, other

    eess.AS eess.SP

    DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model

    Authors: Helin Wang, Thomas Thebaud, Jesus Villalba, Myra Sydnor, Becky Lammers, Najim Dehak, Laureano Moro-Velazquez

    Abstract: We present a novel typical-to-atypical voice conversion approach (DuTa-VC), which (i) can be trained with nonparallel data (ii) first introduces diffusion probabilistic model (iii) preserves the target speaker identity (iv) is aware of the phoneme duration of the target speaker. DuTa-VC consists of three parts: an encoder transforms the source mel-spectrogram into a duration-modified speaker-indep… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

  5. arXiv:2304.05974  [pdf, other


    Regularizing Contrastive Predictive Coding for Speech Applications

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Self-supervised methods such as Contrastive predictive Coding (CPC) have greatly improved the quality of the unsupervised representations. These representations significantly reduce the amount of labeled data needed for downstream task performance, such as automatic speech recognition. CPC learns representations by learning to predict future frames given current frames. Based on the observation th… ▽ More

    Submitted 26 April, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

  6. arXiv:2303.03657  [pdf, other


    Self-FiLM: Conditioning GANs with self-supervised representations for bandwidth extension based speaker recognition

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak

    Abstract: Speech super-resolution/Bandwidth Extension (BWE) can improve downstream tasks like Automatic Speaker Verification (ASV). We introduce a simple novel technique called Self-FiLM to inject self-supervision into existing BWE models via Feature-wise Linear Modulation. We hypothesize that such information captures domain/environment information, which can give zero-shot generalization. Self-FiLM Condit… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: Under review

  7. arXiv:2209.01702  [pdf, other


    Time-domain speech super-resolution with GAN based modeling for telephony speaker verification

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Piotr Żelasko, Najim Dehak

    Abstract: Automatic Speaker Verification (ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for develo** a universal model that works for both narrowband and wideband domains. We propose complementing this t… ▽ More

    Submitted 4 September, 2022; originally announced September 2022.

    Comments: Submit to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  8. arXiv:2208.05445  [pdf, other

    eess.AS cs.AI cs.LG

    Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

    Authors: Jae** Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-s… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: EARLY ACCESS of IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing

  9. arXiv:2208.05413  [pdf, other

    eess.AS cs.LG

    Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

    Authors: Jae** Cho, Raghavendra Pappagari, Piotr Żelasko, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

    Abstract: Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples b… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Accepted at Interspeech 2022

  10. arXiv:2203.16614  [pdf, other

    eess.AS cs.SD

    Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone speech to wideband microphone speech. We developed par… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: submitted to Interspeech 2022

  11. arXiv:2110.02345  [pdf, other

    eess.AS cs.SD

    Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learn… ▽ More

    Submitted 8 October, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.02170

  12. arXiv:2109.06112  [pdf, other

    cs.CL cs.SD eess.AS

    Beyond Isolated Utterances: Conversational Emotion Recognition

    Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU 2021

  13. arXiv:2106.08427  [pdf, other

    cs.SD cs.CL eess.AS

    Pathological voice adaptation with autoencoder-based voice conversion

    Authors: Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez, Odette Scharenborg

    Abstract: In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: 6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021)

  14. arXiv:2106.02170  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive pr… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

  15. arXiv:2104.01433  [pdf, other


    Deep Feature CycleGANs: Speaker Identity Preserving Non-parallel Microphone-Telephone Domain Adaptation for Speaker Verification

    Authors: Saurabh Kataria, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

    Abstract: With the increase in the availability of speech from varied domains, it is imperative to use such out-of-domain data to improve existing speech systems. Domain adaptation is a prominent pre-processing approach for this. We investigate it for adapt microphone speech to the telephone domain. Specifically, we explore CycleGAN-based unpaired translation of microphone data to improve the x-vector/speak… ▽ More

    Submitted 3 April, 2021; originally announced April 2021.

  16. arXiv:2104.00994  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg

    Abstract: This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. I… ▽ More

    Submitted 7 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted for publication in INTERSPEECH 2021

  17. arXiv:2101.08909  [pdf, other

    eess.AS cs.SD

    Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems

    Authors: Sonal Joshi, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose def… ▽ More

    Submitted 25 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  18. arXiv:2011.14259  [pdf, other

    eess.IV cs.CV

    Artificial Intelligence applied to chest X-Ray images for the automatic detection of COVID-19. A thoughtful evaluation approach

    Authors: Julian D. Arias-Londoño, Jorge A. Gomez-Garcia, Laureano Moro-Velazquez, Juan I. Godino-Llorente

    Abstract: Current standard protocols used in the clinic for diagnosing COVID-19 include molecular or antigen tests, generally complemented by a plain chest X-Ray. The combined analysis aims to reduce the significant number of false negatives of these tests, but also to provide complementary evidence about the presence and severity of the disease. However, the procedure is not free of errors, and the interpr… ▽ More

    Submitted 28 November, 2020; originally announced November 2020.

  19. arXiv:2010.14602  [pdf, ps, other

    cs.SD cs.LG eess.AS

    CopyPaste: An Augmentation Method for Speech Emotion Recognition

    Authors: Raghavendra Pappagari, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictat… ▽ More

    Submitted 11 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted at ICASSP2021

  20. How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More

    Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

  21. arXiv:2005.08118  [pdf, other

    eess.AS cs.CL cs.SD

    That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

    Authors: Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020. For some reason, the ArXiv Latex engine rendered it in more than 4 pages