Skip to main content

Showing 1–19 of 19 results for author: Likhomanenko, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.15216  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

    Authors: Zi** Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

    Abstract: Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: under review

  2. arXiv:2402.00340  [pdf, other

    cs.SD eess.AS

    Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?

    Authors: Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Skyler Seto, Tatiana Likhomanenko, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Barry-John Theobald

    Abstract: Self-supervised features are typically used in place of filter-bank features in speaker verification models. However, these models were originally designed to ingest filter-bank features as inputs, and thus, training them on top of self-supervised features assumes that both feature types require the same amount of learning for the task. In this work, we observe that pre-trained self-supervised spe… ▽ More

    Submitted 13 June, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  3. arXiv:2309.17395  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

    Authors: Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko

    Abstract: Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination… ▽ More

    Submitted 29 September, 2023; originally announced September 2023.

    Comments: Under review

  4. arXiv:2309.13102  [pdf, other

    eess.AS cs.DC cs.LG cs.SD

    Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR

    Authors: Sheikh Shams Azam, Tatiana Likhomanenko, Martin Pelikan, Jan "Honza" Silovsky

    Abstract: In this paper, we start by training End-to-End Automatic Speech Recognition (ASR) models using Federated Learning (FL) and examining the fundamental considerations that can be pivotal in minimizing the performance gap in terms of word error rate between models trained using FL versus their centralized counterpart. Specifically, we study the effect of (i) adaptive optimizers, (ii) loss characterist… ▽ More

    Submitted 22 September, 2023; originally announced September 2023.

    Comments: In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2023

  5. arXiv:2305.13330  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Unsupervised ASR via Cross-Lingual Pseudo-Labeling

    Authors: Tatiana Likhomanenko, Loren Lugosch, Ronan Collobert

    Abstract: Recent work has shown that it is possible to train an $\textit{unsupervised}$ automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is $\textit{always}$ labeled data available for other languages. We sh… ▽ More

    Submitted 16 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  6. arXiv:2212.09982  [pdf, other

    cs.CL cs.SD eess.AS

    Joint Speech Transcription and Translation: Pseudo-Labeling with Out-of-Distribution Data

    Authors: Mozhdeh Gheini, Tatiana Likhomanenko, Matthias Sperber, Hendra Setiawan

    Abstract: Self-training has been shown to be helpful in addressing data scarcity for many domains, including vision, speech, and language. Specifically, self-training, or pseudo-labeling, labels unsupervised data and adds that to the training pool. In this work, we investigate and use pseudo-labeling for a recently proposed novel setup: joint transcription and translation of speech, which suffers from an ab… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

  7. arXiv:2211.06007  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Continuous Soft Pseudo-Labeling in ASR

    Authors: Tatiana Likhomanenko, Ronan Collobert, Navdeep Jaitly, Samy Bengio

    Abstract: Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final mo… ▽ More

    Submitted 30 January, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

  8. arXiv:2211.00854  [pdf, other

    cs.LG cs.SD eess.AS

    More Speaking or More Speakers?

    Authors: Dan Berrebbi, Ronan Collobert, Navdeep Jaitly, Tatiana Likhomanenko

    Abstract: Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of number of speakers in the train… ▽ More

    Submitted 2 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  9. arXiv:2207.07611  [pdf, other

    cs.LG cs.CV cs.SD eess.AS

    Position Prediction as an Effective Pretraining Strategy

    Authors: Shuangfei Zhai, Navdeep Jaitly, Jason Ramapuram, Dan Busbridge, Tatiana Likhomanenko, Joseph Yitan Cheng, Walter Talbott, Chen Huang, Hanlin Goh, Joshua Susskind

    Abstract: Transformers have gained increasing popularity in a wide range of applications, including Natural Language Processing (NLP), Computer Vision and Speech Recognition, because of their powerful representational capacity. However, harnessing this representational capacity effectively requires a large amount of data, strong regularization, or both, to mitigate overfitting. Recently, the power of the Tr… ▽ More

    Submitted 15 July, 2022; originally announced July 2022.

    Comments: Accepted to ICML 2022

  10. arXiv:2111.00161  [pdf, other

    cs.CL cs.SD eess.AS

    Pseudo-Labeling for Massively Multilingual Speech Recognition

    Authors: Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised l… ▽ More

    Submitted 8 March, 2022; v1 submitted 29 October, 2021; originally announced November 2021.

    Comments: Accepted to ICASSP 2022. New version has links to code/models + more training curves for larger model. (Fixed code link.)

  11. arXiv:2110.05994  [pdf, other

    eess.AS cs.CL cs.SD

    Word Order Does Not Matter For Speech Recognition

    Authors: Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the p… ▽ More

    Submitted 18 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

  12. arXiv:2106.07759  [pdf, ps, other

    eess.AS cs.CL

    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

    Authors: Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Ka… ▽ More

    Submitted 27 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Updated with camera ready version

  13. arXiv:2104.01027  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

    Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which… ▽ More

    Submitted 8 September, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

  14. arXiv:2010.11745  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS

    Rethinking Evaluation in ASR: Are Our Models Robust Enough?

    Authors: Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

    Abstract: Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset… ▽ More

    Submitted 2 May, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    MSC Class: 68T07; 68T10 ACM Class: I.2.6; I.5.4

  15. arXiv:2010.11430  [pdf, other

    cs.LG cs.SD eess.AS

    Self-training and Pre-training are Complementary for Speech Recognition

    Authors: Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  16. arXiv:2005.09267  [pdf, other

    cs.CL cs.SD eess.AS

    Iterative Pseudo-Labeling for Speech Recognition

    Authors: Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

    Abstract: Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.… ▽ More

    Submitted 26 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: INTERSPEECH 2020

  17. arXiv:2001.09727  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Up Online Speech Recognition Using ConvNets

    Authors: Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency a… ▽ More

    Submitted 27 January, 2020; originally announced January 2020.

  18. Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More

    Submitted 17 December, 2019; originally announced December 2019.

  19. arXiv:1911.08460  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

    Authors: Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert

    Abstract: We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance… ▽ More

    Submitted 14 July, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

    Comments: Published at the workshop on Self-supervision in Audio and Speech (SAS) at the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria