Skip to main content

Showing 1–50 of 54 results for author: Collobert, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.15216  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition

    Authors: Zi** Gu, Tatiana Likhomanenko, He Bai, Erik McDermott, Ronan Collobert, Navdeep Jaitly

    Abstract: Language models (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: under review

  2. arXiv:2309.17395  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    AV-CPL: Continuous Pseudo-Labeling for Audio-Visual Speech Recognition

    Authors: Andrew Rouditchenko, Ronan Collobert, Tatiana Likhomanenko

    Abstract: Audio-visual speech contains synchronized audio and visual information that provides cross-modal supervision to learn representations for both automatic speech recognition (ASR) and visual speech recognition (VSR). We introduce continuous pseudo-labeling for audio-visual speech recognition (AV-CPL), a semi-supervised method to train an audio-visual speech recognition (AVSR) model on a combination… ▽ More

    Submitted 29 September, 2023; originally announced September 2023.

    Comments: Under review

  3. arXiv:2305.13330  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Unsupervised ASR via Cross-Lingual Pseudo-Labeling

    Authors: Tatiana Likhomanenko, Loren Lugosch, Ronan Collobert

    Abstract: Recent work has shown that it is possible to train an $\textit{unsupervised}$ automatic speech recognition (ASR) system using only unpaired audio and text. Existing unsupervised ASR methods assume that no labeled data can be used for training. We argue that even if one does not have any labeled audio for a given language, there is $\textit{always}$ labeled data available for other languages. We sh… ▽ More

    Submitted 16 February, 2024; v1 submitted 18 May, 2023; originally announced May 2023.

  4. arXiv:2211.06007  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Continuous Soft Pseudo-Labeling in ASR

    Authors: Tatiana Likhomanenko, Ronan Collobert, Navdeep Jaitly, Samy Bengio

    Abstract: Continuous pseudo-labeling (PL) algorithms such as slimIPL have recently emerged as a powerful strategy for semi-supervised learning in speech recognition. In contrast with earlier strategies that alternated between training a model and generating pseudo-labels (PLs) with it, here PLs are generated in end-to-end manner as training proceeds, improving training speed and the accuracy of the final mo… ▽ More

    Submitted 30 January, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

  5. arXiv:2211.00854  [pdf, other

    cs.LG cs.SD eess.AS

    More Speaking or More Speakers?

    Authors: Dan Berrebbi, Ronan Collobert, Navdeep Jaitly, Tatiana Likhomanenko

    Abstract: Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of number of speakers in the train… ▽ More

    Submitted 2 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023

  6. arXiv:2210.08711  [pdf, other

    cs.LG

    Continuous Pseudo-Labeling from the Start

    Authors: Dan Berrebbi, Ronan Collobert, Samy Bengio, Navdeep Jaitly, Tatiana Likhomanenko

    Abstract: Self-training (ST), or pseudo-labeling has sparked significant interest in the automatic speech recognition (ASR) community recently because of its success in harnessing unlabeled data. Unlike prior semi-supervised learning approaches that relied on iteratively regenerating pseudo-labels (PLs) from a trained model and using them to train a new model, recent state-of-the-art methods perform `contin… ▽ More

    Submitted 7 April, 2023; v1 submitted 16 October, 2022; originally announced October 2022.

    Comments: To appear in ICLR 2023

  7. arXiv:2201.12465  [pdf, other

    cs.LG cs.AI cs.DC

    Flashlight: Enabling Innovation in Tools for Machine Learning

    Authors: Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, Benoit Steiner, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the… ▽ More

    Submitted 22 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Presented at ICML 2022

  8. arXiv:2201.12208  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Star Temporal Classification: Sequence Classification with Partially Labeled Data

    Authors: Vineel Pratap, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

    Abstract: We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be mi… ▽ More

    Submitted 3 March, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

  9. arXiv:2111.00161  [pdf, other

    cs.CL cs.SD eess.AS

    Pseudo-Labeling for Massively Multilingual Speech Recognition

    Authors: Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised l… ▽ More

    Submitted 8 March, 2022; v1 submitted 29 October, 2021; originally announced November 2021.

    Comments: Accepted to ICASSP 2022. New version has links to code/models + more training curves for larger model. (Fixed code link.)

  10. arXiv:2110.05994  [pdf, other

    eess.AS cs.CL cs.SD

    Word Order Does Not Matter For Speech Recognition

    Authors: Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the p… ▽ More

    Submitted 18 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

  11. arXiv:2106.07759  [pdf, ps, other

    eess.AS cs.CL

    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

    Authors: Vimal Manohar, Tatiana Likhomanenko, Qiantong Xu, Wei-Ning Hsu, Ronan Collobert, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed

    Abstract: In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised speech recognition (ASR). The proposed approach uses a teacher model which is updated as the exponential moving average (EMA) of the student model parameters. We demonstrate that it is critical for EMA to be accumulated with full-precision floating point. The Ka… ▽ More

    Submitted 27 October, 2021; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Updated with camera ready version

  12. arXiv:2106.03143  [pdf, other

    cs.LG cs.CL cs.CV

    CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

    Authors: Tatiana Likhomanenko, Qiantong Xu, Gabriel Synnaeve, Ronan Collobert, Alex Rogozhnikov

    Abstract: Without positional information, attention-based Transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed Transformer models with positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences longer than seen at training time. Relative posit… ▽ More

    Submitted 8 November, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

  13. arXiv:2104.01027  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

    Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which… ▽ More

    Submitted 8 September, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

  14. MLS: A Large-Scale Multilingual Dataset for Speech Research

    Authors: Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

    Abstract: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models an… ▽ More

    Submitted 19 December, 2020; v1 submitted 6 December, 2020; originally announced December 2020.

    Journal ref: Interspeech 2020

  15. arXiv:2011.00093  [pdf, other

    cs.CL cs.LG cs.SD

    Joint Masked CPC and CTC Training for ASR

    Authors: Chaitanya Talnikar, Tatiana Likhomanenko, Ronan Collobert, Gabriel Synnaeve

    Abstract: Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised… ▽ More

    Submitted 13 February, 2021; v1 submitted 30 October, 2020; originally announced November 2020.

    Comments: ICASSP 2021

  16. arXiv:2010.11745  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS

    Rethinking Evaluation in ASR: Are Our Models Robust Enough?

    Authors: Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

    Abstract: Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset… ▽ More

    Submitted 2 May, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    MSC Class: 68T07; 68T10 ACM Class: I.2.6; I.5.4

  17. arXiv:2010.11524  [pdf, other

    cs.CL cs.LG

    SlimIPL: Language-Model-Free Iterative Pseudo-Labeling

    Authors: Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert

    Abstract: Recent results in end-to-end automatic speech recognition have demonstrated the efficacy of pseudo-labeling for semi-supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to fu… ▽ More

    Submitted 29 August, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

  18. arXiv:2010.11430  [pdf, other

    cs.LG cs.SD eess.AS

    Self-training and Pre-training are Complementary for Speech Recognition

    Authors: Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  19. arXiv:2007.03001  [pdf, other

    eess.AS cs.CL cs.SD

    Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

    Authors: Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three vari… ▽ More

    Submitted 7 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

  20. arXiv:2006.13979  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised Cross-lingual Representation Learning for Speech Recognition

    Authors: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

    Abstract: This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and… ▽ More

    Submitted 15 December, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

  21. arXiv:2005.09267  [pdf, other

    cs.CL cs.SD eess.AS

    Iterative Pseudo-Labeling for Speech Recognition

    Authors: Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

    Abstract: Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.… ▽ More

    Submitted 26 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: INTERSPEECH 2020

  22. arXiv:2005.00581  [pdf, other

    cs.CL cs.LG

    Multi-scale Transformer Language Models

    Authors: Sandeep Subramanian, Ronan Collobert, Marc'Aurelio Ranzato, Y-Lan Boureau

    Abstract: We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments on large-scale language modeling benchmarks empirically demonstrate favorable likelihood vs memory footprint trade-offs, e.g. we show that it is possible to trai… ▽ More

    Submitted 1 May, 2020; originally announced May 2020.

  23. arXiv:2001.09727  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Up Online Speech Recognition Using ConvNets

    Authors: Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency a… ▽ More

    Submitted 27 January, 2020; originally announced January 2020.

  24. Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Authors: Jacob Kahn, Morgane Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR… ▽ More

    Submitted 17 December, 2019; originally announced December 2019.

  25. arXiv:1911.08460  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

    Authors: Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert

    Abstract: We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance… ▽ More

    Submitted 14 July, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

    Comments: Published at the workshop on Self-supervision in Audio and Speech (SAS) at the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria

  26. arXiv:1906.04323  [pdf, other

    cs.CL cs.SD eess.AS

    Word-level Speech Recognition with a Letter to Word Encoder

    Authors: Ronan Collobert, Awni Hannun, Gabriel Synnaeve

    Abstract: We propose a direct-to-word sequence model which uses a word network to learn word embeddings from letters. The word network can be integrated seamlessly with arbitrary sequence models including Connectionist Temporal Classification and encoder-decoder models with attention. We show our direct-to-word model can achieve word error rate gains over sub-word level models for speech recognition. We als… ▽ More

    Submitted 14 July, 2020; v1 submitted 10 June, 2019; originally announced June 2019.

    Comments: ICML 2020

  27. arXiv:1904.05862  [pdf, other

    cs.CL

    wav2vec: Unsupervised Pre-training for Speech Recognition

    Authors: Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

    Abstract: We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce… ▽ More

    Submitted 11 September, 2019; v1 submitted 11 April, 2019; originally announced April 2019.

  28. Who Needs Words? Lexicon-Free Speech Recognition

    Authors: Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: Lexicon-free speech recognition naturally deals with the problem of out-of-vocabulary (OOV) words. In this paper, we show that character-based language models (LM) can perform as well as word-based LMs for speech recognition, in word error rates (WER), even without restricting the decoding to a lexicon. We study character-based LMs and show that convolutional LMs can effectively leverage large (ch… ▽ More

    Submitted 13 September, 2019; v1 submitted 9 April, 2019; originally announced April 2019.

    Comments: 8 pages, 1 figure

    Journal ref: Proc. Interspeech 2019

  29. arXiv:1904.02619  [pdf, other

    cs.CL

    Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

    Authors: Awni Hannun, Ann Lee, Qiantong Xu, Ronan Collobert

    Abstract: We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while kee** the receptive field large. We… ▽ More

    Submitted 4 April, 2019; originally announced April 2019.

  30. arXiv:1902.06022  [pdf, other

    cs.CL

    A Fully Differentiable Beam Search Decoder

    Authors: Ronan Collobert, Awni Hannun, Gabriel Synnaeve

    Abstract: We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operate at different granularities (e.g. acoustic and language models). It can be used when target sequences are not aligned to input sequences by considering all possible alignments between the two. We d… ▽ More

    Submitted 15 February, 2019; originally announced February 2019.

  31. wav2letter++: The Fastest Open-source Speech Recognition System

    Authors: Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Synnaeve, Vitaliy Liptchinsky, Ronan Collobert

    Abstract: This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster th… ▽ More

    Submitted 18 December, 2018; originally announced December 2018.

  32. arXiv:1812.06864  [pdf, other

    cs.CL

    Fully Convolutional Speech Recognition

    Authors: Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert

    Abstract: Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language mod… ▽ More

    Submitted 9 April, 2019; v1 submitted 17 December, 2018; originally announced December 2018.

  33. arXiv:1812.03483  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS stat.ML

    To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition

    Authors: Yossi Adi, Neil Zeghidour, Ronan Collobert, Nicolas Usunier, Vitaliy Liptchinsky, Gabriel Synnaeve

    Abstract: Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; we expect a performance improvement with this joint training if the two tasks of speech recognition and speaker recognition share a common… ▽ More

    Submitted 14 February, 2019; v1 submitted 9 December, 2018; originally announced December 2018.

  34. arXiv:1806.07098  [pdf, other

    cs.CL cs.SD eess.AS

    End-to-End Speech Recognition From the Raw Waveform

    Authors: Neil Zeghidour, Nicolas Usunier, Gabriel Synnaeve, Ronan Collobert, Emmanuel Dupoux

    Abstract: State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone fi… ▽ More

    Submitted 21 June, 2018; v1 submitted 19 June, 2018; originally announced June 2018.

    Comments: Accepted for presentation at Interspeech 2018

  35. arXiv:1712.09444  [pdf, other

    cs.CL cs.AI

    Letter-Based Speech Recognition with Gated ConvNets

    Authors: Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forc… ▽ More

    Submitted 15 February, 2019; v1 submitted 22 December, 2017; originally announced December 2017.

    Comments: 13 pages.arXiv admin note: text overlap with arXiv:1609.03193

  36. arXiv:1609.03193  [pdf, other

    cs.LG cs.AI cs.CL

    Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

    Authors: Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

    Abstract: This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simp… ▽ More

    Submitted 12 September, 2016; v1 submitted 11 September, 2016; originally announced September 2016.

    Comments: 8 pages, 4 figures (7 plots/schemas), 2 tables (4 tabulars)

    ACM Class: I.2.6; I.2.7

  37. arXiv:1606.09560  [pdf, other

    cs.CL

    Neural Network-based Word Alignment through Score Aggregation

    Authors: Joel Legrand, Michael Auli, Ronan Collobert

    Abstract: We present a simple neural network for word alignment that builds source and target word window representations to compute alignment scores for sentence pairs. To enable unsupervised training, we use an aggregation operation that summarizes the alignment scores for a given target word. A soft-margin objective increases scores for true target words while decreasing scores for target words that are… ▽ More

    Submitted 30 June, 2016; originally announced June 2016.

  38. arXiv:1603.08695  [pdf, other

    cs.CV

    Learning to Refine Object Segments

    Authors: Pedro O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, Piotr Dollàr

    Abstract: Object segmentation requires both object-level information and low-level pixel data. This presents a challenge for feedforward networks: lower layers in convolutional nets capture rich spatial information, while upper layers encode object-level knowledge but are invariant to factors such as pose and appearance. In this work we propose to augment feedforward nets for object segmentation with a nove… ▽ More

    Submitted 26 July, 2016; v1 submitted 29 March, 2016; originally announced March 2016.

    Comments: extended version of ECCV camera-ready (figures 6-9 only in arXiv)

  39. arXiv:1511.03776  [pdf, other

    cs.CV

    ProNet: Learning to Propose Object-specific Boxes for Cascaded Neural Networks

    Authors: Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, Lubomir Bourdev

    Abstract: This paper aims to classify and locate objects accurately and efficiently, without using bounding box annotations. It is challenging as objects in the wild could appear at arbitrary locations and in different scales. In this paper, we propose a novel classification architecture ProNet based on convolutional neural networks. It uses computationally efficient neural networks to propose image regions… ▽ More

    Submitted 12 April, 2016; v1 submitted 12 November, 2015; originally announced November 2015.

    Comments: CVPR 2016 (fixed reference issue)

  40. arXiv:1506.06204  [pdf, other

    cs.CV

    Learning to Segment Object Candidates

    Authors: Pedro O. Pinheiro, Ronan Collobert, Piotr Dollar

    Abstract: Recent object detection systems rely on two critical steps: (1) a set of object proposals is predicted as efficiently as possible, and (2) this set of candidate proposals is then passed to an object classifier. Such approaches have been shown they can be fast, while achieving the state of the art in detection performance. In this paper, we propose a new way to generate object proposals, introducin… ▽ More

    Submitted 1 September, 2015; v1 submitted 20 June, 2015; originally announced June 2015.

  41. arXiv:1506.05703  [pdf, other

    cs.CL

    "The Sum of Its Parts": Joint Learning of Word and Phrase Representations with Autoencoders

    Authors: Rémi Lebret, Ronan Collobert

    Abstract: Recently, there has been a lot of effort to represent words in continuous vector spaces. Those representations have been shown to capture both semantic and syntactic information about words. However, distributed representations of phrases remain a challenge. We introduce a novel model that jointly learns word vector representations and their summation. Word representations are learnt using the wor… ▽ More

    Submitted 18 June, 2015; originally announced June 2015.

    Comments: Deep Learning Workshop, ICML 2015

  42. arXiv:1502.03671  [pdf, other

    cs.CL

    Phrase-based Image Captioning

    Authors: Rémi Lebret, Pedro O. Pinheiro, Ronan Collobert

    Abstract: Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representation… ▽ More

    Submitted 9 April, 2015; v1 submitted 12 February, 2015; originally announced February 2015.

  43. arXiv:1412.8419  [pdf, other

    cs.CL cs.CV cs.NE

    Simple Image Description Generator via a Linear Phrase-Based Approach

    Authors: Remi Lebret, Pedro O. Pinheiro, Ronan Collobert

    Abstract: Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representation… ▽ More

    Submitted 10 April, 2015; v1 submitted 29 December, 2014; originally announced December 2014.

    Comments: Accepted as a workshop paper at ICLR 2015

  44. arXiv:1412.7110  [pdf, other

    cs.LG cs.CL cs.NE

    Learning linearly separable features for speech recognition using convolutional neural networks

    Authors: Dimitri Palaz, Mathew Magimai Doss, Ronan Collobert

    Abstract: Automatic speech recognition systems usually rely on spectral-based features, such as MFCC of PLP. These features are extracted based on prior knowledge such as, speech perception or/and speech production. Recently, convolutional neural networks have been shown to be able to estimate phoneme conditional probabilities in a completely data-driven manner, i.e. using directly temporal raw speech signa… ▽ More

    Submitted 16 April, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

    Comments: Final version for ICLR 2015 Workshop; Revisions according to reviews. Revised Section 4.5. Add references and correct typos. Submitted for ICLR 2015 conference track

  45. arXiv:1412.7028  [pdf, other

    cs.LG cs.CL cs.NE

    Joint RNN-Based Greedy Parsing and Word Composition

    Authors: Joël Legrand, Ronan Collobert

    Abstract: This paper introduces a greedy parser based on neural networks, which leverages a new compositional sub-tree representation. The greedy parser and the compositional procedure are jointly trained, and tightly depends on each-other. The composition procedure outputs a vector representation which summarizes syntactically (parsing tags) and semantically (words) sub-trees. Composition and tagging is ac… ▽ More

    Submitted 10 April, 2015; v1 submitted 22 December, 2014; originally announced December 2014.

    Comments: Published as a conference paper at ICLR 2015

  46. arXiv:1412.6604  [pdf, ps, other

    cs.LG cs.CV

    Video (language) modeling: a baseline for generative models of natural videos

    Authors: MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, Sumit Chopra

    Abstract: We propose a strong baseline model for unsupervised feature learning using video data. By learning to predict missing frames or extrapolate future frames from an input video sequence, the model discovers both spatial and temporal correlations which are useful to represent complex deformations and motion patterns. The models we propose are largely borrowed from the language modeling literature, and… ▽ More

    Submitted 4 May, 2016; v1 submitted 20 December, 2014; originally announced December 2014.

  47. arXiv:1412.6277  [pdf, ps, other

    cs.CL

    N-gram-Based Low-Dimensional Representation for Document Classification

    Authors: Rémi Lebret, Ronan Collobert

    Abstract: The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA), have been designed to summarize documents in a lower dimension with the least semantic information loss. Some semanti… ▽ More

    Submitted 10 April, 2015; v1 submitted 19 December, 2014; originally announced December 2014.

    Comments: Accepted as a workshop contribution at ICLR 2015

  48. arXiv:1412.4930  [pdf, other

    cs.CL

    Rehabilitation of Count-based Models for Word Vector Representations

    Authors: Rémi Lebret, Ronan Collobert

    Abstract: Recent works on word representations mostly rely on predictive models. Distributed word representations (aka word embeddings) are trained to optimally predict the contexts in which the corresponding words tend to appear. Such models have succeeded in capturing word similarties as well as semantic and syntactic regularities. Instead, we aim at reviving interest in a model based on counts. We presen… ▽ More

    Submitted 8 April, 2015; v1 submitted 16 December, 2014; originally announced December 2014.

    Comments: A. Gelbukh (Ed.), Springer International Publishing Switzerland

    Journal ref: CICLing 2015, Part I, LNCS 9041, pp. 417-429, 2015

  49. arXiv:1411.6228  [pdf, other

    cs.CV

    From Image-level to Pixel-level Labeling with Convolutional Networks

    Authors: Pedro O. Pinheiro, Ronan Collobert

    Abstract: We are interested in inferring object segmentation by leveraging only object class information, and by considering only minimal priors on the object segmentation task. This problem could be viewed as a kind of weakly supervised segmentation task, and naturally fits the Multiple Instance Learning (MIL) framework: every training image is known to have (or not) at least one pixel corresponding to the… ▽ More

    Submitted 24 April, 2015; v1 submitted 23 November, 2014; originally announced November 2014.

    Comments: CVPR2015

  50. arXiv:1312.5542  [pdf, ps, other

    cs.CL cs.LG

    Word Emdeddings through Hellinger PCA

    Authors: Rémi Lebret, Ronan Collobert

    Abstract: Word embeddings resulting from neural language models have been shown to be successful for a large variety of NLP tasks. However, such architecture might be difficult to train and time-consuming. Instead, we propose to drastically simplify the word embeddings computation through a Hellinger PCA of the word co-occurence matrix. We compare those new word embeddings with some well-known embeddings on… ▽ More

    Submitted 4 January, 2017; v1 submitted 19 December, 2013; originally announced December 2013.

    Comments: 9 pages, 5 tables

    Journal ref: Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2014