Skip to main content

Showing 1–50 of 56 results for author: Villalba, J

.
  1. arXiv:2406.07461  [pdf, other

    eess.AS

    Noise-robust Speech Separation with Fast Generative Correction

    Authors: Helin Wang, Jesus Villalba, Laureano Moro-Velazquez, Jiarui Hai, Thomas Thebaud, Najim Dehak

    Abstract: Speech separation, the task of isolating multiple speech sources from a mixed audio signal, remains challenging in noisy environments. In this paper, we propose a generative correction method to enhance the output of a discriminative separator. By leveraging a generative corrector based on a diffusion model, we refine the separation process for single-channel mixture speech by removing noises and… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted at INTERSPEECH 2024

  2. arXiv:2404.10930  [pdf, ps, other

    math.OC

    A preconditioner for solving linear programming problems with dense columns

    Authors: Catalina J. Villalba, Aurelio R. L. Oliveira

    Abstract: The Interior-Point Methods are a class for solving linear programming problems that rely upon the solution of linear systems. At each iteration, it becomes important to determine how to solve these linear systems when the constraint matrix of the linear programming problem includes dense columns. In this paper, we propose a preconditioner to handle linear programming problems with dense columns, a… ▽ More

    Submitted 16 April, 2024; originally announced April 2024.

  3. arXiv:2403.07891  [pdf

    cs.CV cs.CR cs.LG

    Digital Video Manipulation Detection Technique Based on Compression Algorithms

    Authors: Edgar Gonzalez Fernandez, Ana Lucila Sandoval Orozco, Luis Javier Garcia Villalba

    Abstract: Digital images and videos play a very important role in everyday life. Nowadays, people have access the affordable mobile devices equipped with advanced integrated cameras and powerful image processing applications. Technological development facilitates not only the generation of multimedia content, but also the intentional modification of it, either with recreational or malicious purposes. This i… ▽ More

    Submitted 3 February, 2024; originally announced March 2024.

    Journal ref: IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 3, pp. 2596-2605, December 2021

  4. arXiv:2402.19355  [pdf, other

    cs.SD cs.CR cs.LG eess.AS

    Unraveling Adversarial Examples against Speaker Identification -- Techniques for Attack Detection and Victim Model Classification

    Authors: Sonal Joshi, Thomas Thebaud, Jesús Villalba, Najim Dehak

    Abstract: Adversarial examples have proven to threaten speaker identification systems, and several countermeasures against them have been proposed. In this paper, we propose a method to detect the presence of adversarial examples, i.e., a binary classifier distinguishing between benign and adversarial examples. We build upon and extend previous work on attack type classification by exploring new architectur… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

  5. Adaptive Artificial Immune Networks for Mitigating DoS flooding Attacks

    Authors: Jorge Maestre Vidal, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

    Abstract: Denial of service attacks pose a threat in constant growth. This is mainly due to their tendency to gain in sophistication, ease of implementation, obfuscation and the recent improvements in occultation of fingerprints. On the other hand, progress towards self-organizing networks, and the different techniques involved in their development, such as software-defined networking, network-function virt… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

    Journal ref: J. Maestre Vidal, A. L. Sandoval Orozco, L. J. García Villalba: Adaptive Artificial Immune Networks for Mitigating DoS Flooding Attacks. Swarm and Evolutionary Computation. Vol. 38, pp. 3894-108, February 2018

  6. Compression effects and scene details on the source camera identification of digital videos

    Authors: Raquel Ramos López, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

    Abstract: The continuous growth of technologies like 4G or 5G has led to a massive use of mobile devices such as smartphones and tablets. This phenomenon, combined with the fact that people use mobile phones for a longer period of time, results in mobile phones becoming the main source of creation of visual information. However, its reliability as a true representation of reality cannot be taken for granted… ▽ More

    Submitted 7 February, 2024; originally announced February 2024.

    Journal ref: Expert Systems with Applications, Vol. 170, pp. 114515, May 2021

  7. arXiv:2402.06661  [pdf

    cs.CR cs.LG cs.MM eess.IV

    Authentication and integrity of smartphone videos through multimedia container structure analysis

    Authors: Carlos Quinto Huamán, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

    Abstract: Nowadays, mobile devices have become the natural substitute for the digital camera, as they capture everyday situations easily and quickly, encouraging users to express themselves through images and videos. These videos can be shared across different platforms exposing them to any kind of intentional manipulation by criminals who are aware of the weaknesses of forensic techniques to accuse an inno… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Journal ref: Quinto Huamán, A. L. Sandoval Orozco, L. J. García Villalba: Authentication and Integrity of Smartphone Videos Through Multimedia Container Structure Analysis. Future Generation Computer Systems. Vol. 108, pp. 15-33, July 2020

  8. A novel pattern recognition system for detecting Android malware by analyzing suspicious boot sequences

    Authors: Jorge Maestre Vidal, Marco Antonio Sotelo Monge, Luis Javier García Villalba

    Abstract: This paper introduces a malware detection system for smartphones based on studying the dynamic behavior of suspicious applications. The main goal is to prevent the installation of the malicious software on the victim systems. The approach focuses on identifying malware addressed against the Android platform. For that purpose, only the system calls performed during the boot process of the recently… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Journal ref: Knowledge-Based Systems. Vol. 150, pp. 198-217, June 2018

  9. A security framework for Ethereum smart contracts

    Authors: Antonio López Vivar, Ana Lucila Sandoval Orozco, Luis Javier García Villalba

    Abstract: The use of blockchain and smart contracts have not stopped growing in recent years. Like all software that begins to expand its use, it is also beginning to be targeted by hackers who will try to exploit vulnerabilities in both the underlying technology and the smart contract code itself. While many tools already exist for analyzing vulnerabilities in smart contracts, the heterogeneity and variety… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Journal ref: Computer Communications. Vol. 172, pp. 119-129, April 2021

  10. arXiv:2402.02240  [pdf

    cs.CR stat.CO

    Recommendations on Statistical Randomness Test Batteries for Cryptographic Purposes

    Authors: Elena Almaraz Luengo, Luis Javier García Villalba

    Abstract: Security in different applications is closely related to the goodness of the sequences generated for such purposes. Not only in Cryptography but also in other areas, it is necessary to obtain long sequences of random numbers or that, at least, behave as such. To decide whether the generator used produces sequences that are random, unpredictable and independent, statistical checks are needed. Diffe… ▽ More

    Submitted 3 February, 2024; originally announced February 2024.

    Journal ref: ACM Computing Surveys, Vol. 54, No. 80, pp. 12420, May 2021

  11. arXiv:2401.09464  [pdf

    cs.AR

    Floating Point HUB Adder for RISC-V Sargantana Processor

    Authors: Gerardo Bandera, Javier Salamero, Miquel Moreto, Julio Villalba

    Abstract: HUB format is an emerging technique to improve the hardware and time requirement when round to nearest is needed. On the other hand, RISC-V is an open-source ISA that many companies currently use in their designs. This paper presents a tailored floating point HUB adder implemented in the Sargantana RISC-V processor.

    Submitted 8 January, 2024; originally announced January 2024.

    Comments: RISC-V Summit Europe, Barcelona, 5-9th June 2023

  12. arXiv:2401.00923  [pdf, other

    astro-ph.CO

    Stacking the spectra of eROSITA galaxy cluster data for searches of the 3.5keV line: Dark matter decay or charge exchange?

    Authors: Justo Antonio Gonzalez Villalba

    Abstract: In this Master Thesis, we use a technique to shift and stack the X-Ray spectra of 1138 galaxy clusters from the eRASS-1 survey, totalling 430649 counts. In comparison with previous stacking techniques, the method presented here introduces proper normalization of the shifted redistribution matrix file (RMF), which allows to recover the physical temperature and metallicity of the stacked spectra. Us… ▽ More

    Submitted 15 February, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

  13. arXiv:2309.04628  [pdf, other

    eess.AS cs.SD

    Leveraging Pretrained Image-text Models for Improving Audio-Visual Learning

    Authors: Saurabhchand Bhati, Jesús Villalba, Laureano Moro-Velazquez, Thomas Thebaud, Najim Dehak

    Abstract: Visually grounded speech systems learn from paired images and their spoken captions. Recently, there have been attempts to utilize the visually grounded models trained from images and their corresponding text captions, such as CLIP, to improve speech-based visually grounded models' performance. However, the majority of these models only utilize the pretrained image encoder. Cascaded SpeechCLIP att… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

  14. arXiv:2306.10588  [pdf, other

    eess.AS eess.SP

    DuTa-VC: A Duration-aware Typical-to-atypical Voice Conversion Approach with Diffusion Probabilistic Model

    Authors: Helin Wang, Thomas Thebaud, Jesus Villalba, Myra Sydnor, Becky Lammers, Najim Dehak, Laureano Moro-Velazquez

    Abstract: We present a novel typical-to-atypical voice conversion approach (DuTa-VC), which (i) can be trained with nonparallel data (ii) first introduces diffusion probabilistic model (iii) preserves the target speaker identity (iv) is aware of the phoneme duration of the target speaker. DuTa-VC consists of three parts: an encoder transforms the source mel-spectrogram into a duration-modified speaker-indep… ▽ More

    Submitted 18 June, 2023; originally announced June 2023.

  15. arXiv:2304.05974  [pdf, other

    eess.AS

    Regularizing Contrastive Predictive Coding for Speech Applications

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Self-supervised methods such as Contrastive predictive Coding (CPC) have greatly improved the quality of the unsupervised representations. These representations significantly reduce the amount of labeled data needed for downstream task performance, such as automatic speech recognition. CPC learns representations by learning to predict future frames given current frames. Based on the observation th… ▽ More

    Submitted 26 April, 2023; v1 submitted 12 April, 2023; originally announced April 2023.

  16. arXiv:2303.04187  [pdf, other

    cs.LG

    Stabilized training of joint energy-based models and their practical applications

    Authors: Martin Sustek, Samik Sadhu, Lukas Burget, Hynek Hermansky, Jesus Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: The recently proposed Joint Energy-based Model (JEM) interprets discriminatively trained classifier $p(y|x)$ as an energy model, which is also trained as a generative model describing the distribution of the input observations $p(x)$. The JEM training relies on "positive examples" (i.e. examples from the training data set) as well as on "negative examples", which are samples from the modeled distr… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  17. arXiv:2303.03657  [pdf, other

    eess.AS

    Self-FiLM: Conditioning GANs with self-supervised representations for bandwidth extension based speaker recognition

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Thomas Thebaud, Najim Dehak

    Abstract: Speech super-resolution/Bandwidth Extension (BWE) can improve downstream tasks like Automatic Speaker Verification (ASV). We introduce a simple novel technique called Self-FiLM to inject self-supervision into existing BWE models via Feature-wise Linear Modulation. We hypothesize that such information captures domain/environment information, which can give zero-shot generalization. Self-FiLM Condit… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: Under review

  18. arXiv:2210.02276  [pdf, other

    astro-ph.IM astro-ph.GA astro-ph.HE astro-ph.SR

    CASA, the Common Astronomy Software Applications for Radio Astronomy

    Authors: THE CASA TEAM, Ben Bean, Sanjay Bhatnagar, Sandra Castro, Jennifer Donovan Meyer, Bjorn Emonts, Enrique Garcia, Robert Garwood, Kumar Golap, Justo Gonzalez Villalba, Pamela Harris, Yohei Hayashi, Josh Hoskins, Mingyu Hsieh, Preshanth Jagannathan, Wataru Kawasaki, Aard Keimpema, Mark Kettenis, Jorge Lopez, Joshua Marvil, Joseph Masters, Andrew McNichols, David Mehringer, Renaud Miel, George Moellenbrock , et al. (24 additional authors not shown)

    Abstract: CASA, the Common Astronomy Software Applications, is the primary data processing software for the Atacama Large Millimeter/submillimeter Array (ALMA) and the Karl G. Jansky Very Large Array (VLA), and is frequently used also for other radio telescopes. The CASA software can handle data from single-dish, aperture-synthesis, and Very Long Baseline Interferometery (VLBI) telescopes. One of its core f… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

    Comments: Accepted for publication in PASP (20 pages, 4 figures). Joint publication with CASA-VLBI paper

  19. arXiv:2209.01702  [pdf, other

    eess.AS

    Time-domain speech super-resolution with GAN based modeling for telephony speaker verification

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Piotr Żelasko, Najim Dehak

    Abstract: Automatic Speaker Verification (ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for develo** a universal model that works for both narrowband and wideband domains. We propose complementing this t… ▽ More

    Submitted 4 September, 2022; originally announced September 2022.

    Comments: Submit to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  20. arXiv:2208.05445  [pdf, other

    eess.AS cs.AI cs.LG

    Non-Contrastive Self-supervised Learning for Utterance-Level Information Extraction from Speech

    Authors: Jae** Cho, Jes'us Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: In recent studies, self-supervised pre-trained models tend to outperform supervised pre-trained models in transfer learning. In particular, self-supervised learning (SSL) of utterance-level speech representation can be used in speech applications that require discriminative representation of consistent attributes within an utterance: speaker, language, emotion, and age. Existing frame-level self-s… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: EARLY ACCESS of IEEE JSTSP Special Issue on Self-Supervised Learning for Speech and Audio Processing

  21. arXiv:2208.05413  [pdf, other

    eess.AS cs.LG

    Non-Contrastive Self-Supervised Learning of Utterance-Level Speech Representations

    Authors: Jae** Cho, Raghavendra Pappagari, Piotr Żelasko, Laureano Moro-Velazquez, Jesús Villalba, Najim Dehak

    Abstract: Considering the abundance of unlabeled speech data and the high labeling costs, unsupervised learning methods can be essential for better system development. One of the most successful methods is contrastive self-supervised methods, which require negative sampling: sampling alternative samples to contrast with the current sample (anchor). However, it is hard to ensure if all the negative samples b… ▽ More

    Submitted 10 August, 2022; originally announced August 2022.

    Comments: Accepted at Interspeech 2022

  22. arXiv:2204.03851  [pdf, other

    eess.AS cs.CR cs.SD

    Defense against Adversarial Attacks on Hybrid Speech Recognition using Joint Adversarial Fine-tuning with Denoiser

    Authors: Sonal Joshi, Saurabh Kataria, Yiwen Shao, Piotr Zelasko, Jesus Villalba, Sanjeev Khudanpur, Najim Dehak

    Abstract: Adversarial attacks are a threat to automatic speech recognition (ASR) systems, and it becomes imperative to propose defenses to protect them. In this paper, we perform experiments to show that K2 conformer hybrid ASR is strongly affected by white-box adversarial attacks. We propose three defenses--denoiser pre-processor, adversarially fine-tuning ASR model, and adversarially fine-tuning joint mod… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech 2022

  23. arXiv:2204.03848  [pdf, ps, other

    eess.AS cs.CR cs.SD

    AdvEst: Adversarial Perturbation Estimation to Classify and Detect Adversarial Attacks against Speaker Identification

    Authors: Sonal Joshi, Saurabh Kataria, Jesus Villalba, Najim Dehak

    Abstract: Adversarial attacks pose a severe security threat to the state-of-the-art speaker identification systems, thereby making it vital to propose countermeasures against them. Building on our previous work that used representation learning to classify and detect adversarial attacks, we propose an improvement to it using AdvEst, a method to estimate adversarial perturbation. First, we prove our claim th… ▽ More

    Submitted 8 April, 2022; originally announced April 2022.

    Comments: Submitted to InterSpeech 2022

  24. arXiv:2203.16614  [pdf, other

    eess.AS cs.SD

    Joint domain adaptation and speech bandwidth extension using time-domain GANs for speaker verification

    Authors: Saurabh Kataria, Jesús Villalba, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Speech systems developed for a particular choice of acoustic domain and sampling frequency do not translate easily to others. The usual practice is to learn domain adaptation and bandwidth extension models independently. Contrary to this, we propose to learn both tasks together. Particularly, we learn to map narrowband conversational telephone speech to wideband microphone speech. We developed par… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: submitted to Interspeech 2022

  25. arXiv:2201.05169  [pdf, other

    astro-ph.GA astro-ph.CO astro-ph.HE

    The eROSITA Final Equatorial Depth Survey (eFEDS): X-ray emission around star-forming and quiescent galaxies at $0.05<z<0.3$

    Authors: Johan Comparat, Nhut Truong, Andrea Merloni, Annalisa Pillepich, Gabriele Ponti, Simon Driver, Sabine Bellstedt, Joe Liske, James Aird, Marcus Brüggen, Esra Bulbul, Luke Davies, Justo Antonio González Villalba, Antonis Georgakakis, Frank Haberl, Teng Liu, Chandreyee Maitra, Kirpal Nandra, Paola Popesso, Peter Predehl, Aaron Robotham, Mara Salvato, Jessica E. Thorne, Yi Zhang

    Abstract: We aim at characterizing the hot phase of the Circum-Galactic Medium in a large sample of galaxies. We stack X-ray events from the SRG/eROSITA eFEDS survey around central galaxies in the GAMA 9hr field to construct radially projected soft X-ray luminosity profiles as a function of their stellar mass and specific star formation rate. We consider samples of quiescent (star-forming) galaxies in the s… ▽ More

    Submitted 17 August, 2022; v1 submitted 13 January, 2022; originally announced January 2022.

    Comments: 23 pages, 11 figures, 4 tables, accepted in A&A

    Journal ref: A&A 666, A156 (2022)

  26. arXiv:2110.02345  [pdf, other

    eess.AS cs.SD

    Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learn… ▽ More

    Submitted 8 October, 2021; v1 submitted 5 October, 2021; originally announced October 2021.

    Comments: arXiv admin note: substantial text overlap with arXiv:2106.02170

  27. arXiv:2109.13425  [pdf, ps, other

    eess.AS cs.LG cs.SD

    The JHU submission to VoxSRC-21: Track 3

    Authors: Je** Cho, Jesus Villalba, Najim Dehak

    Abstract: This technical report describes Johns Hopkins University speaker recognition system submitted to Voxceleb Speaker Recognition Challenge 2021 Track 3: Self-supervised speaker verification (closed). Our overall training process is similar to the proposed one from the first place team in the last year's VoxSRC2020 challenge. The main difference is a recently proposed non-contrastive self-supervised m… ▽ More

    Submitted 27 September, 2021; originally announced September 2021.

  28. arXiv:2109.06112  [pdf, other

    cs.CL cs.SD eess.AS

    Beyond Isolated Utterances: Conversational Emotion Recognition

    Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Speech emotion recognition is the task of recognizing the speaker's emotional state given a recording of their utterance. While most of the current approaches focus on inferring emotion from isolated utterances, we argue that this is not sufficient to achieve conversational emotion recognition (CER) which deals with recognizing emotions in conversations. In this work, we propose several approaches… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted for ASRU 2021

  29. arXiv:2107.04448  [pdf, other

    eess.AS

    Representation Learning to Classify and Detect Adversarial Attacks against Speaker and Speech Recognition Systems

    Authors: Jesús Villalba, Sonal Joshi, Piotr Żelasko, Najim Dehak

    Abstract: Adversarial attacks have become a major threat for machine learning applications. There is a growing interest in studying these attacks in the audio domain, e.g, speech and speaker recognition; and find defenses against them. In this work, we focus on using representation learning to classify/detect attacks w.r.t. the attack algorithm, threat model or signal-to-adversarial-noise ratio. We found th… ▽ More

    Submitted 9 July, 2021; originally announced July 2021.

    Comments: Accepted at Interspeech 2021

  30. arXiv:2106.02170  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Automatic detection of phoneme or word-like units is one of the core objectives in zero-resource speech processing. Recent attempts employ self-supervised training methods, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive pr… ▽ More

    Submitted 3 June, 2021; originally announced June 2021.

  31. arXiv:2104.01433  [pdf, other

    eess.AS

    Deep Feature CycleGANs: Speaker Identity Preserving Non-parallel Microphone-Telephone Domain Adaptation for Speaker Verification

    Authors: Saurabh Kataria, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

    Abstract: With the increase in the availability of speech from varied domains, it is imperative to use such out-of-domain data to improve existing speech systems. Domain adaptation is a prominent pre-processing approach for this. We investigate it for adapt microphone speech to the telephone domain. Specifically, we explore CycleGAN-based unpaired translation of microphone data to improve the x-vector/speak… ▽ More

    Submitted 3 April, 2021; originally announced April 2021.

  32. arXiv:2103.17122  [pdf, ps, other

    eess.AS cs.CR cs.SD

    Adversarial Attacks and Defenses for Speech Recognition Systems

    Authors: Piotr Żelasko, Sonal Joshi, Yiwen Shao, Jesus Villalba, Jan Trmal, Najim Dehak, Sanjeev Khudanpur

    Abstract: The ubiquitous presence of machine learning systems in our lives necessitates research into their vulnerabilities and appropriate countermeasures. In particular, we investigate the effectiveness of adversarial attacks and defenses against automatic speech recognition (ASR) systems. We select two ASR models - a thoroughly studied DeepSpeech model and a more recent Espresso framework Transformer enc… ▽ More

    Submitted 31 March, 2021; originally announced March 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  33. arXiv:2101.08909  [pdf, other

    eess.AS cs.SD

    Study of Pre-processing Defenses against Adversarial Attacks on State-of-the-art Speaker Recognition Systems

    Authors: Sonal Joshi, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velázquez, Najim Dehak

    Abstract: Adversarial examples to speaker recognition (SR) systems are generated by adding a carefully crafted noise to the speech signal to make the system fail while being imperceptible to humans. Such attacks pose severe security risks, making it vital to deep-dive and understand how much the state-of-the-art SR systems are vulnerable to these attacks. Moreover, it is of greater importance to propose def… ▽ More

    Submitted 25 June, 2021; v1 submitted 21 January, 2021; originally announced January 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  34. arXiv:2011.02090  [pdf, other

    eess.AS cs.SD

    Frustratingly Easy Noise-aware Training of Acoustic Models

    Authors: Desh Raj, Jesus Villalba, Daniel Povey, Sanjeev Khudanpur

    Abstract: Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it requires many-folds data augmentation, resulting in increased training time. In this paper, we propose utterance-level noise vectors for noise-aware training of a… ▽ More

    Submitted 2 February, 2021; v1 submitted 3 November, 2020; originally announced November 2020.

    Comments: 6 + 3 (Appendix) pages

  35. arXiv:2011.01210  [pdf, other

    eess.AS cs.LG

    Focus on the present: a regularization method for the ASR source-target attention layer

    Authors: Nanxin Chen, Piotr Żelasko, Jesús Villalba, Najim Dehak

    Abstract: This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models with joint connectionist temporal classification (CTC) and attention training. Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations. To understand the functionality of the attention, CTC is applie… ▽ More

    Submitted 2 November, 2020; originally announced November 2020.

    Comments: submitted to ICASSP2021. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  36. arXiv:2010.14602  [pdf, ps, other

    cs.SD cs.LG eess.AS

    CopyPaste: An Augmentation Method for Speech Emotion Recognition

    Authors: Raghavendra Pappagari, Jesús Villalba, Piotr Żelasko, Laureano Moro-Velazquez, Najim Dehak

    Abstract: Data augmentation is a widely used strategy for training robust machine learning models. It partially alleviates the problem of limited data for tasks like speech emotion recognition (SER), where collecting data is expensive and challenging. This study proposes CopyPaste, a perceptually motivated novel augmentation procedure for SER. Assuming that the presence of emotions other than neutral dictat… ▽ More

    Submitted 11 February, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: Accepted at ICASSP2021

  37. arXiv:2010.11860  [pdf, other

    eess.AS cs.SD

    Perceptual Loss based Speech Denoising with an ensemble of Audio Pattern Recognition and Self-Supervised Models

    Authors: Saurabh Kataria, Jesús Villalba, Najim Dehak

    Abstract: Deep learning based speech denoising still suffers from the challenge of improving perceptual quality of enhanced signals. We introduce a generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses. Perceptual loss discourages distortion to certain speech properties and we analyze it using six large-scale pre-trained models: speaker classifica… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  38. arXiv:2010.11221  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Speaker Embedding from Text-to-Speech

    Authors: Jae** Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim Dehak

    Abstract: Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices given an input text and the corresponding speaker embedding. In this work, we investigate the effectiveness of the TTS reconstruction objective to improve representation learning for speaker verification. We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We hypothesi… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  39. arXiv:2007.13033  [pdf, other

    eess.AS cs.LG cs.SD

    Self-Expressing Autoencoders for Unsupervised Spoken Term Discovery

    Authors: Saurabhchand Bhati, Jesús Villalba, Piotr Żelasko, Najim Dehak

    Abstract: Unsupervised spoken term discovery consists of two tasks: finding the acoustic segment boundaries and labeling acoustically similar segments with the same labels. We perform segmentation based on the assumption that the frame feature vectors are more similar within a segment than across the segments. Therefore, for strong segmentation performance, it is crucial that the features represent the phon… ▽ More

    Submitted 25 July, 2020; originally announced July 2020.

  40. arXiv:2005.08331  [pdf, ps, other

    eess.AS cs.SD

    Single Channel Far Field Feature Enhancement For Speaker Verification In The Wild

    Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Paola García-Perera, Jesús Villalba, Najim Dehak

    Abstract: We investigated an enhancement and a domain adaptation approach to make speaker verification systems robust to perturbations of far-field speech. In the enhancement approach, using paired (parallel) reverberant-clean speech, we trained a supervised Generative Adversarial Network (GAN) along with a feature map** loss. For the domain adaptation approach, we trained a Cycle Consistent Generative Ad… ▽ More

    Submitted 17 May, 2020; originally announced May 2020.

    Comments: submitted to INTERSPEECH 2020

  41. arXiv:2002.05039  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    x-vectors meet emotions: A study on dependencies between emotion and speaker recognition

    Authors: Raghavendra Pappagari, Tianzi Wang, Jesus Villalba, Nanxin Chen, Najim Dehak

    Abstract: In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features… ▽ More

    Submitted 12 February, 2020; originally announced February 2020.

    Comments: 45th International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020

  42. arXiv:2002.00139  [pdf, other

    eess.AS cs.SD

    Analysis of Deep Feature Loss based Enhancement for Speaker Verification

    Authors: Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Najim Dehak

    Abstract: Data augmentation is conventionally used to inject robustness in Speaker Verification systems. Several recently organized challenges focus on handling novel acoustic environments. Deep learning based speech enhancement is a modern solution for this. Recently, a study proposed to optimize the enhancement network in the activation space of a pre-trained auxiliary network. This methodology, called de… ▽ More

    Submitted 27 April, 2020; v1 submitted 31 January, 2020; originally announced February 2020.

    Comments: 8 pages; accepted in Odyssey2020 workshop

  43. arXiv:1912.00938  [pdf

    eess.AS cs.SD

    Speaker detection in the wild: Lessons learned from JSALT 2019

    Authors: Paola Garcia, Jesus Villalba, Herve Bredin, Jun Du, Diego Castan, Alejandrina Cristia, Latane Bullock, Ling Guo, Koji Okabe, Phani Sankar Nidadavolu, Saurabh Kataria, Sizhu Chen, Leo Galmant, Marvin Lavechin, Lei Sun, Marie-Philippe Gill, Bar Ben-Yair, Sajjad Abdoli, Xin Wang, Wassim Bouaziz, Hadrien Titeux, Emmanuel Dupoux, Kong Aik Lee, Najim Dehak

    Abstract: This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker dete… ▽ More

    Submitted 2 December, 2019; originally announced December 2019.

    Comments: Submitted to ICASSP 2020

  44. arXiv:1911.04908  [pdf, other

    eess.AS cs.LG cs.SD stat.ML

    Listen and Fill in the Missing Letters: Non-Autoregressive Transformer for Speech Recognition

    Authors: Nanxin Chen, Shinji Watanabe, Jesús Villalba, Najim Dehak

    Abstract: Recently very deep transformers have outperformed conventional bi-directional long short-term memory networks by a large margin in speech recognition. However, to put it into production usage, inference computation cost is still a serious concern in real scenarios. In this paper, we study two different non-autoregressive transformer structure for automatic speech recognition (ASR): A-CMLM and A-FM… ▽ More

    Submitted 6 April, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

  45. arXiv:1911.00432  [pdf, other

    eess.AS

    Deep neural networks for emotion recognition combining audio and transcripts

    Authors: Jae** Cho, Raghavendra Pappagari, Purva Kulkarni, Jesus Villalba, Yishay Carmiel, Najim Dehak

    Abstract: In this paper, we propose to improve emotion recognition by combining acoustic information and conversation transcripts. On the one hand, an LSTM network was used to detect emotion from acoustic features like f0, shimmer, jitter, MFCC, etc. On the other hand, a multi-resolution CNN was used to detect emotion from word sequences. This CNN consists of several parallel convolutions with different ker… ▽ More

    Submitted 1 November, 2019; originally announced November 2019.

  46. arXiv:1910.11915  [pdf, ps, other

    eess.AS cs.SD

    Unsupervised Feature Enhancement for speaker verification

    Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, Paola García-Perera, Najim Dehak

    Abstract: The task of making speaker verification systems robust to adverse scenarios remain a challenging and an active area of research. We developed an unsupervised feature enhancement approach in log-filter bank domain with the end goal of improving speaker verification performance. We experimented with using both real speech recorded in adverse environments and degraded speech obtained by simulation to… ▽ More

    Submitted 14 February, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: 5 pages; accepted in ICASSP 2020

  47. arXiv:1910.11909  [pdf, other

    eess.AS cs.SD

    Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-GANs

    Authors: Phani Sankar Nidadavolu, Saurabh Kataria, Jesús Villalba, Najim Dehak

    Abstract: Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Netorks (CycleGAN) has received a lot of attention. CycleGAN… ▽ More

    Submitted 25 October, 2019; originally announced October 2019.

    Comments: 8 pages, accepted to ASRU 2019

  48. arXiv:1910.11905  [pdf, ps, other

    eess.AS cs.SD

    Feature Enhancement with Deep Feature Losses for Speaker Verification

    Authors: Saurabh Kataria, Phani Sankar Nidadavolu, Jesús Villalba, Nanxin Chen, Paola García, Najim Dehak

    Abstract: Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker emb… ▽ More

    Submitted 14 February, 2020; v1 submitted 25 October, 2019; originally announced October 2019.

    Comments: 5 pages, accepted in ICASSP 2020

  49. arXiv:1910.10781  [pdf, ps, other

    cs.CL cs.LG stat.ML

    Hierarchical Transformers for Long Document Classification

    Authors: Raghavendra Pappagari, Piotr Żelasko, Jesús Villalba, Yishay Carmiel, Najim Dehak

    Abstract: BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We… ▽ More

    Submitted 23 October, 2019; originally announced October 2019.

    Comments: 4 figures, 7 pages

    Journal ref: Automatic Speech Recognition and Understanding Workshop, 2019

  50. arXiv:1904.01120  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    ASSERT: Anti-Spoofing with Squeeze-Excitation and Residual neTworks

    Authors: Cheng-I Lai, Nanxin Chen, Jesús Villalba, Najim Dehak

    Abstract: We present JHU's system submission to the ASVspoof 2019 Challenge: Anti-Spoofing with Squeeze-Excitation and Residual neTworks (ASSERT). Anti-spoofing has gathered more and more attention since the inauguration of the ASVspoof Challenges, and ASVspoof 2019 dedicates to address attacks from all three major types: text-to-speech, voice conversion, and replay. Built upon previous research work on Dee… ▽ More

    Submitted 1 April, 2019; originally announced April 2019.

    Comments: Submitted to Interspeech 2019, Graz, Austria