Skip to main content

Showing 1–50 of 82 results for author: Synnaeve, G

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.19737  [pdf, other

    cs.CL

    Better & Faster Large Language Models via Multi-token Prediction

    Authors: Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

    Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared m… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  2. arXiv:2402.05755  [pdf, other

    cs.CL cs.SD eess.AS

    SpiRit-LM: Interleaved Spoken and Written Language Model

    Authors: Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta R. Costa-jussa, Maha Elbayad, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres, Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoit Sagot, Emmanuel Dupoux

    Abstract: We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a pretrained text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are concatenated as a single set of tokens, and trained with a word-level interleaving method using a small automatically-curated… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  3. arXiv:2402.01035  [pdf, other

    cs.CL

    Getting the most out of your tokenizer for pre-training and domain adaptation

    Authors: Gautier Dagan, Gabriel Synnaeve, Baptiste Rozière

    Abstract: Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expressio… ▽ More

    Submitted 7 February, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  4. arXiv:2401.04577  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Masked Audio Generation using a Single Non-Autoregressive Transformer

    Authors: Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

    Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. T… ▽ More

    Submitted 5 March, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

  5. arXiv:2401.03065  [pdf, other

    cs.SE cs.AI cs.LG

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Authors: Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang

    Abstract: We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: 71 pages, 29 figures

  6. arXiv:2312.04724  [pdf, other

    cs.CR cs.LG

    Purple Llama CyberSecEval: A Secure Coding Benchmark for Language Models

    Authors: Manish Bhatt, Sahana Chennabasappa, Cyrus Nikolaidis, Shengye Wan, Ivan Evtimov, Dominik Gabi, Daniel Song, Faizan Ahmad, Cornelius Aschermann, Lorenzo Fontana, Sasha Frolov, Ravi Prakash Giri, Dhaval Kapil, Yiannis Kozyrakis, David LeBlanc, James Milazzo, Aleksandar Straumann, Gabriel Synnaeve, Varun Vontimitta, Spencer Whitman, Joshua Saxe

    Abstract: This paper presents CyberSecEval, a comprehensive benchmark developed to help bolster the cybersecurity of Large Language Models (LLMs) employed as coding assistants. As what we believe to be the most extensive unified cybersecurity safety benchmark to date, CyberSecEval provides a thorough evaluation of LLMs in two crucial security domains: their propensity to generate insecure code and their lev… ▽ More

    Submitted 7 December, 2023; originally announced December 2023.

  7. arXiv:2310.05224  [pdf, other

    cs.CL cs.LG

    Generative Spoken Language Model based on continuous word-sized audio tokens

    Authors: Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoit Sagot, Emmanuel Dupoux

    Abstract: In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can g… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Conference paper at EMNLP 2023

  8. arXiv:2309.07974  [pdf, other

    cs.LG cs.AI

    A Data Source for Reasoning Embodied Agents

    Authors: Jack Lanchantin, Sainbayar Sukhbaatar, Gabriel Synnaeve, Yuxuan Sun, Kavya Srinet, Arthur Szlam

    Abstract: Recent progress in using machine learning models for reasoning tasks has been driven by novel model architectures, large-scale pre-training protocols, and dedicated reasoning datasets for fine-tuning. In this work, to further pursue these advances, we introduce a new data generator for machine reasoning that integrates with an embodied agent. The generated data consists of templated text queries a… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

  9. arXiv:2309.07062  [pdf, other

    cs.PL cs.AI cs.CL cs.LG

    Large Language Models for Compiler Optimization

    Authors: Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather

    Abstract: We explore the novel application of Large Language Models to code optimization. We present a 7B-parameter transformer model trained from scratch to optimize LLVM assembly for code size. The model takes as input unoptimized assembly and outputs a list of compiler options to best optimize the program. Crucially, during training, we ask the model to predict the instruction counts before and after opt… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

  10. arXiv:2308.12950  [pdf, other

    cs.CL

    Code Llama: Open Foundation Models for Code

    Authors: Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom , et al. (1 additional authors not shown)

    Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama… ▽ More

    Submitted 31 January, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

  11. arXiv:2308.05725  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

    Authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

    Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthes… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

  12. arXiv:2308.02560  [pdf, other

    cs.SD cs.LG eess.AS

    From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

    Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez

    Abstract: Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the condi… ▽ More

    Submitted 8 November, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

    Comments: 10 pages

    Journal ref: Thirty-seventh Conference on Neural Information Processing Systems (2023)

  13. arXiv:2306.05284  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Simple and Controllable Music Generation

    Authors: Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez

    Abstract: We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchicall… ▽ More

    Submitted 29 January, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Published at Neurips 2023

  14. arXiv:2305.13009  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Textually Pretrained Speech Language Models

    Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

    Abstract: Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de… ▽ More

    Submitted 30 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  15. arXiv:2304.10379  [pdf, other

    cs.SE

    Leveraging Static Analysis for Bug Repair

    Authors: Ruba Mutasim, Gabriel Synnaeve, David Pichardie, Baptiste Rozière

    Abstract: We propose a method combining machine learning with a static analysis tool (i.e. Infer) to automatically repair source code. Machine Learning methods perform well for producing idiomatic source code. However, their output is sometimes difficult to trust as language models can output incorrect code with high confidence. Static analysis tools are trustable, but also less flexible and produce non-idi… ▽ More

    Submitted 21 April, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

    Comments: 13 pages. DL4C 2023

  16. arXiv:2304.07193  [pdf, other

    cs.CV

    DINOv2: Learning Robust Visual Features without Supervision

    Authors: Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin , et al. (1 additional authors not shown)

    Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pr… ▽ More

    Submitted 2 February, 2024; v1 submitted 14 April, 2023; originally announced April 2023.

  17. arXiv:2212.11353  [pdf, other

    cs.CL cs.LG

    Contrastive Distillation Is a Sample-Efficient Self-Supervised Loss Policy for Transfer Learning

    Authors: Chris Lengerich, Gabriel Synnaeve, Amy Zhang, Hugh Leather, Kurt Shuster, François Charton, Charysse Redwood

    Abstract: Traditional approaches to RL have focused on learning decision policies directly from episodic decisions, while slowly and implicitly learning the semantics of compositional representations needed for generalization. While some approaches have been adopted to refine representations via auxiliary self-supervised losses while simultaneously learning decision policies, learning compositional represen… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

  18. arXiv:2211.01223  [pdf, other

    cs.SD eess.AS

    Audio Language Modeling using Perceptually-Guided Discrete Representations

    Authors: Felix Kreuk, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve, Alexandre Défossez, Yossi Adi

    Abstract: In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform a… ▽ More

    Submitted 4 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

  19. arXiv:2210.14685  [pdf, other

    cs.LG cs.AI cs.RO

    Leveraging Demonstrations with Latent Space Priors

    Authors: Jonas Gehring, Deepak Gopinath, Jungdam Won, Andreas Krause, Gabriel Synnaeve, Nicolas Usunier

    Abstract: Demonstrations provide insight into relevant state or action space regions, bearing great potential to boost the efficiency and practicality of reinforcement learning agents. In this work, we propose to leverage demonstration datasets by combining skill learning and sequence modeling. Starting with a learned joint latent space, we separately train a generative model of demonstration sequences and… ▽ More

    Submitted 13 March, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: Published in Transactions on Machine Learning Research (03/2023)

  20. arXiv:2210.13438  [pdf, other

    eess.AS cs.AI cs.SD stat.ML

    High Fidelity Neural Audio Compression

    Authors: Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

    Abstract: We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Preprint

  21. arXiv:2209.15483  [pdf, other

    cs.CL cs.LG eess.AS

    Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

    Authors: Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, Yossi Adi

    Abstract: Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensi… ▽ More

    Submitted 29 May, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

  22. arXiv:2209.15352  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    AudioGen: Textually Guided Audio Generation

    Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

    Abstract: We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differe… ▽ More

    Submitted 5 March, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023

  23. arXiv:2207.03578  [pdf, other

    cs.PL cs.CL cs.LG

    Code Translation with Compiler Representations

    Authors: Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve

    Abstract: In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looki… ▽ More

    Submitted 24 April, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

    Comments: 9 pages

  24. arXiv:2201.12465  [pdf, other

    cs.LG cs.AI cs.DC

    Flashlight: Enabling Innovation in Tools for Machine Learning

    Authors: Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, Benoit Steiner, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the… ▽ More

    Submitted 22 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Presented at ICML 2022

  25. arXiv:2201.12208  [pdf, other

    cs.LG cs.SD eess.AS stat.ML

    Star Temporal Classification: Sequence Classification with Partially Labeled Data

    Authors: Vineel Pratap, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

    Abstract: We develop an algorithm which can learn from partially labeled and unsegmented sequential data. Most sequential loss functions, such as Connectionist Temporal Classification (CTC), break down when many labels are missing. We address this problem with Star Temporal Classification (STC) which uses a special star token to allow alignments which include all possible tokens whenever a token could be mi… ▽ More

    Submitted 3 March, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

  26. arXiv:2112.13692  [pdf, other

    cs.CV

    Augmenting Convolutional networks with attention-based aggregation

    Authors: Hugo Touvron, Matthieu Cord, Alaaeldin El-Nouby, Piotr Bojanowski, Armand Joulin, Gabriel Synnaeve, Hervé Jégou

    Abstract: We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We replace the final average pooling by an attention-based aggregation layer akin to a single transformer block, that weights how the patches are involved in the classification decision. We plug this learned aggregation layer with a simplistic patch-based convolutional network parame… ▽ More

    Submitted 27 December, 2021; originally announced December 2021.

  27. arXiv:2111.00161  [pdf, other

    cs.CL cs.SD eess.AS

    Pseudo-Labeling for Massively Multilingual Speech Recognition

    Authors: Loren Lugosch, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: Semi-supervised learning through pseudo-labeling has become a staple of state-of-the-art monolingual speech recognition systems. In this work, we extend pseudo-labeling to massively multilingual speech recognition with 60 languages. We propose a simple pseudo-labeling recipe that works well even with low-resource languages: train a supervised multilingual model, fine-tune it with semi-supervised l… ▽ More

    Submitted 8 March, 2022; v1 submitted 29 October, 2021; originally announced November 2021.

    Comments: Accepted to ICASSP 2022. New version has links to code/models + more training curves for larger model. (Fixed code link.)

  28. arXiv:2110.10809  [pdf, other

    cs.LG cs.AI cs.RO

    Hierarchical Skills for Efficient Exploration

    Authors: Jonas Gehring, Gabriel Synnaeve, Andreas Krause, Nicolas Usunier

    Abstract: In reinforcement learning, pre-trained low-level skills have the potential to greatly facilitate exploration. However, prior knowledge of the downstream task is required to strike the right balance between generality (fine-grained control) and specificity (faster learning) in skill design. In previous work on continuous control, the sensitivity of methods to this trade-off has not been addressed e… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: To appear in 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

  29. arXiv:2110.08583  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    ASR4REAL: An extended benchmark for speech models

    Authors: Morgane Riviere, Jade Copet, Gabriel Synnaeve

    Abstract: Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by acc… ▽ More

    Submitted 16 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  30. arXiv:2110.06773  [pdf, other

    cs.SE cs.CL cs.LG

    Leveraging Automated Unit Tests for Unsupervised Code Translation

    Authors: Baptiste Roziere, Jie M. Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample

    Abstract: With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive… ▽ More

    Submitted 16 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

  31. arXiv:2110.05994  [pdf, other

    eess.AS cs.CL cs.SD

    Word Order Does Not Matter For Speech Recognition

    Authors: Vineel Pratap, Qiantong Xu, Tatiana Likhomanenko, Gabriel Synnaeve, Ronan Collobert

    Abstract: In this paper, we study training of automatic speech recognition system in a weakly supervised setting where the order of words in transcript labels of the audio training data is not known. We train a word-level acoustic model which aggregates the distribution of all output frames using LogSumExp operation and uses a cross-entropy loss to match with the ground-truth words distribution. Using the p… ▽ More

    Submitted 18 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

  32. arXiv:2106.09681  [pdf, other

    cs.CV cs.LG

    XCiT: Cross-Covariance Image Transformers

    Authors: Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, Hervé Jegou

    Abstract: Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic comple… ▽ More

    Submitted 18 June, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

  33. arXiv:2106.03143  [pdf, other

    cs.LG cs.CL cs.CV

    CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

    Authors: Tatiana Likhomanenko, Qiantong Xu, Gabriel Synnaeve, Ronan Collobert, Alex Rogozhnikov

    Abstract: Without positional information, attention-based Transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed Transformer models with positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences longer than seen at training time. Relative posit… ▽ More

    Submitted 8 November, 2021; v1 submitted 6 June, 2021; originally announced June 2021.

  34. arXiv:2105.03404  [pdf, other

    cs.CV

    ResMLP: Feedforward networks for image classification with data-efficient training

    Authors: Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou

    Abstract: We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using hea… ▽ More

    Submitted 10 June, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

  35. arXiv:2104.12763  [pdf, other

    cs.CV cs.CL cs.LG

    MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding

    Authors: Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, Nicolas Carion

    Abstract: Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objects and attributes. This makes it challenging for such systems to capture the long tail of visual concepts expressed in free form text. In this… ▽ More

    Submitted 11 October, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

  36. arXiv:2104.09987  [pdf, other

    stat.ML cs.AI cs.LG

    Differentiable Model Compression via Pseudo Quantization Noise

    Authors: Alexandre Défossez, Yossi Adi, Gabriel Synnaeve

    Abstract: We propose DiffQ a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator). We suggest adding independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. DiffQ is differentiable both with respect to the unquantized weights and the number of bits… ▽ More

    Submitted 17 October, 2022; v1 submitted 20 April, 2021; originally announced April 2021.

    Comments: final TMLR version

  37. arXiv:2104.09937  [pdf, other

    cs.LG stat.ML

    Gradient Matching for Domain Generalization

    Authors: Yuge Shi, Jeffrey Seely, Philip H. S. Torr, N. Siddharth, Awni Hannun, Nicolas Usunier, Gabriel Synnaeve

    Abstract: Machine learning systems typically assume that the distributions of training and test sets match closely. However, a critical requirement of such systems in the real world is their ability to generalize to unseen domains. Here, we propose an inter-domain gradient matching objective that targets domain generalization by maximizing the inner product between gradients from different domains. Since di… ▽ More

    Submitted 13 July, 2021; v1 submitted 20 April, 2021; originally announced April 2021.

  38. arXiv:2104.01027  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

    Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which… ▽ More

    Submitted 8 September, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

  39. arXiv:2103.17239  [pdf, other

    cs.CV

    Going deeper with Image Transformers

    Authors: Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jégou

    Abstract: Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer networks for image classification. In particular, we investigate the interplay of architecture and opt… ▽ More

    Submitted 7 April, 2021; v1 submitted 31 March, 2021; originally announced March 2021.

  40. MLS: A Large-Scale Multilingual Dataset for Speech Research

    Authors: Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, Ronan Collobert

    Abstract: This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research. The dataset is derived from read audiobooks from LibriVox and consists of 8 languages, including about 44.5K hours of English and a total of about 6K hours for other languages. Additionally, we provide Language Models (LM) and baseline Automatic Speech Recognition (ASR) models an… ▽ More

    Submitted 19 December, 2020; v1 submitted 6 December, 2020; originally announced December 2020.

    Journal ref: Interspeech 2020

  41. arXiv:2011.00093  [pdf, other

    cs.CL cs.LG cs.SD

    Joint Masked CPC and CTC Training for ASR

    Authors: Chaitanya Talnikar, Tatiana Likhomanenko, Ronan Collobert, Gabriel Synnaeve

    Abstract: Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec~2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised… ▽ More

    Submitted 13 February, 2021; v1 submitted 30 October, 2020; originally announced November 2020.

    Comments: ICASSP 2021

  42. arXiv:2010.11745  [pdf, ps, other

    cs.LG cs.CL cs.SD eess.AS

    Rethinking Evaluation in ASR: Are Our Models Robust Enough?

    Authors: Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve

    Abstract: Is pushing numbers on a single benchmark valuable in automatic speech recognition? Research results in acoustic modeling are typically evaluated based on performance on a single dataset. While the research community has coalesced around various benchmarks, we set out to understand generalization performance in acoustic modeling across datasets - in particular, if models trained on a single dataset… ▽ More

    Submitted 2 May, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

    MSC Class: 68T07; 68T10 ACM Class: I.2.6; I.5.4

  43. arXiv:2010.11524  [pdf, other

    cs.CL cs.LG

    SlimIPL: Language-Model-Free Iterative Pseudo-Labeling

    Authors: Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert

    Abstract: Recent results in end-to-end automatic speech recognition have demonstrated the efficacy of pseudo-labeling for semi-supervised models trained both with Connectionist Temporal Classification (CTC) and Sequence-to-Sequence (seq2seq) losses. Iterative Pseudo-Labeling (IPL), which continuously trains a single model using pseudo-labels iteratively re-generated as the model learns, has been shown to fu… ▽ More

    Submitted 29 August, 2021; v1 submitted 22 October, 2020; originally announced October 2020.

  44. arXiv:2010.11430  [pdf, other

    cs.LG cs.SD eess.AS

    Self-training and Pre-training are Complementary for Speech Recognition

    Authors: Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  45. arXiv:2010.03899  [pdf, other

    cs.CL cs.SD eess.AS

    Population Based Training for Data Augmentation and Regularization in Speech Recognition

    Authors: Daniel Haziza, Jérémy Rapin, Gabriel Synnaeve

    Abstract: Varying data augmentation policies and regularization over the course of optimization has led to performance improvements over using fixed values. We show that population based training is a useful tool to continuously search those hyperparameters, within a fixed budget. This greatly simplifies the experimental burden and computational cost of finding such optimal schedules. We experiment in speec… ▽ More

    Submitted 8 October, 2020; originally announced October 2020.

    Comments: tech report from Dec. 2019

    MSC Class: 68T10; 68T07 ACM Class: I.5.4; I.2.7

  46. arXiv:2007.03001  [pdf, other

    eess.AS cs.CL cs.SD

    Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters

    Authors: Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three vari… ▽ More

    Submitted 7 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

  47. arXiv:2007.00991  [pdf, other

    eess.AS cs.CL cs.SD

    Data Augmenting Contrastive Learning of Speech Representations in the Time Domain

    Authors: Eugene Kharitonov, Morgane Rivière, Gabriel Synnaeve, Lior Wolf, Pierre-Emmanuel Mazaré, Matthijs Douze, Emmanuel Dupoux

    Abstract: Contrastive Predictive Coding (CPC), based on predicting future segments of speech based on past segments is emerging as a powerful algorithm for representation learning of speech signal. However, it still under-performs other methods on unsupervised evaluation benchmarks. Here, we introduce WavAugment, a time-domain data augmentation library and find that applying augmentation in the past is gene… ▽ More

    Submitted 2 July, 2020; originally announced July 2020.

  48. arXiv:2006.12847  [pdf, ps, other

    eess.AS cs.LG cs.SD stat.ML

    Real Time Speech Enhancement in the Waveform Domain

    Authors: Alexandre Defossez, Gabriel Synnaeve, Yossi Adi

    Abstract: We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is optimized on both time and frequency domains, using multiple loss functions. Empirical evidence shows that it is capable of removing various kinds of background noise including stationary and non… ▽ More

    Submitted 6 September, 2020; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: Interspeech 2020 Paper

  49. arXiv:2005.12872  [pdf, other

    cs.CV

    End-to-End Object Detection with Transformers

    Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko

    Abstract: We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DET… ▽ More

    Submitted 28 May, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

  50. arXiv:2005.09267  [pdf, other

    cs.CL cs.SD eess.AS

    Iterative Pseudo-Labeling for Speech Recognition

    Authors: Qiantong Xu, Tatiana Likhomanenko, Jacob Kahn, Awni Hannun, Gabriel Synnaeve, Ronan Collobert

    Abstract: Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data.… ▽ More

    Submitted 26 August, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: INTERSPEECH 2020