Skip to main content

Showing 1–16 of 16 results for author: Moro-Velázquez, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2311.06170  [pdf, other

    cs.LG

    Time Scale Network: A Shallow Neural Network For Time Series Data

    Authors: Trevor Meyer, Camden Shultz, Najim Dehak, Laureano Moro-Velazquez, Pedro Irazoqui

    Abstract: Time series data is often composed of information at multiple time scales, particularly in biomedical data. While numerous deep learning strategies exist to capture this information, many make networks larger, require more data, are more demanding to compute, and are difficult to interpret. This limits their usefulness in real-world applications facing even modest computational or data constraints… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: 8 pages, 5 figures, preprint

  2. arXiv:2309.04628  [pdf, other

    eess.AS cs.SD

    Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

    Authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak

    Abstract: Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP att… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  3. arXiv:2303.04187  [pdf, other

    cs.LG

    Stabilized training of joint energy-based models and their practical applications

    Authors: Martin Sustek, Samik Sadhu, Lukas Burget, Hynek Hermansky, Jesus Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distr… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  4. arXiv:2208.05445  [pdf, other

    eess.AS cs.AI cs.LG

    Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

    Authors: Jae** Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-s… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: EARLY ACCESS of IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing

  5. arXiv:2208.05413  [pdf, other

    eess.AS cs.LG

    Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

    Authors: Jae** Cho, Raghavendra Pappagari, Piotr Żelasko, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

    Abstract: Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples b… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Accepted at Interspeech 2022

  6. arXiv:2203.16614  [pdf, other

    eess.AS cs.SD

    Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone speech to wideband microphone speech. We developed par… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: submitted to Interspeech 2022

  7. arXiv:2110.02345  [pdf, other

    eess.AS cs.SD

    Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learn… ▽ More

    Submitted 8 October, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.02170

  8. arXiv:2109.06112  [pdf, other

    cs.CL cs.SD eess.AS

    Beyond Isolated Utterances: Conversational Emotion Recognition

    Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU 2021

  9. arXiv:2106.08427  [pdf, other

    cs.SD cs.CL eess.AS

    Pathological voice adaptation with autoencoder-based voice conversion

    Authors: Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez, Odette Scharenborg

    Abstract: In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need… ▽ More

    Submitted 15 June, 2021; originally announced June 2021.

    Comments: 6 pages, 3 figures. Accepted to the 11th ISCA Speech Synthesis Workshop (2021)

  10. arXiv:2106.02170  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive pr… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

  11. arXiv:2104.00994  [pdf, other

    eess.AS cs.CL cs.SD

    Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Odette Scharenborg

    Abstract: This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a two-stage approach: the first stage learns a subword-discriminative feature representation and the second stage applies clustering to the learned representation and obtains phone-like clusters as the discovered acoustic units. I… ▽ More

    Submitted 7 June, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

    Comments: Accepted for publication in INTERSPEECH 2021

  12. arXiv:2101.08909  [pdf, other

    eess.AS cs.SD

    Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems

    Authors: Sonal Joshi, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose def… ▽ More

    Submitted 25 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  13. arXiv:2011.14259  [pdf, other

    eess.IV cs.CV

    Artificial Intelligence applied to chest X-Ray images for the automatic detection of COVID-19. A thoughtful evaluation approach

    Authors: Julian D. Arias-Londoño, Jorge A. Gomez-Garcia, Laureano Moro-Velazquez, Juan I. Godino-Llorente

    Abstract: Current standard protocols used in the clinic for diagnosing COVID-19 include molecular or antigen tests, generally complemented by a plain chest X-Ray. The combined analysis aims to reduce the significant number of false negatives of these tests, but also to provide complementary evidence about the presence and severity of the disease. However, the procedure is not free of errors, and the interpr… ▽ More

    Submitted 28 November, 2020; originally announced November 2020.

  14. arXiv:2010.14602  [pdf, ps, other

    cs.SD cs.LG eess.AS

    CopyPaste: An Augmentation Method for Speech Emotion Recognition

    Authors: Raghavendra Pappagari, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictat… ▽ More

    Submitted 11 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted at ICASSP2021

  15. How Phonotactics Affect Multilingual and Zero-shot ASR Performance

    Authors: Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: The idea of combining multiple languages' recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successfu… ▽ More

    Submitted 10 February, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: Accepted for publication in IEEE ICASSP 2021. The first 2 authors contributed equally to this work

  16. arXiv:2005.08118  [pdf, other

    eess.AS cs.CL cs.SD

    That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

    Authors: Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak

    Abstract: Only a handful of the world's languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus… ▽ More

    Submitted 16 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020. For some reason, the ArXiv Latex engine rendered it in more than 4 pages