Skip to main content

Showing 1–21 of 21 results for author: Roziere, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.19737  [pdf, other

    cs.CL

    Better & Faster Large Language Models via Multi-token Prediction

    Authors: Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

    Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared m… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  2. arXiv:2403.07816  [pdf, other

    cs.CL cs.AI

    Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

    Authors: Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen-tau Yih, Jason Weston, Xian Li

    Abstract: We investigate efficient methods for training Large Language Models (LLMs) to possess capabilities in multiple specialized domains, such as coding, math reasoning and world knowledge. Our method, named Branch-Train-MiX (BTX), starts from a seed model, which is branched to train experts in embarrassingly parallel fashion with high throughput and reduced communication cost. After individual experts… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  3. arXiv:2402.01035  [pdf, other

    cs.CL

    Getting the most out of your tokenizer for pre-training and domain adaptation

    Authors: Gautier Dagan, Gabriel Synnaeve, Baptiste Rozière

    Abstract: Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expressio… ▽ More

    Submitted 7 February, 2024; v1 submitted 1 February, 2024; originally announced February 2024.

  4. arXiv:2401.03065  [pdf, other

    cs.SE cs.AI cs.LG

    CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution

    Authors: Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, Sida I. Wang

    Abstract: We present CRUXEval (Code Reasoning, Understanding, and eXecution Evaluation), a benchmark consisting of 800 Python functions (3-13 lines). Each function comes with an input-output pair, leading to two natural tasks: input prediction and output prediction. First, we propose a generic recipe for generating our execution benchmark which can be used to create future variation of the benchmark. Second… ▽ More

    Submitted 5 January, 2024; originally announced January 2024.

    Comments: 71 pages, 29 figures

  5. arXiv:2312.02418  [pdf, other

    cs.CL cs.AI cs.LG

    Decoding Data Quality via Synthetic Corruptions: Embedding-guided Pruning of Code Data

    Authors: Yu Yang, Aaditya K. Singh, Mostafa Elhoushi, Anas Mahmoud, Kushal Tirumala, Fabian Gloeckle, Baptiste Rozière, Carole-Jean Wu, Ari S. Morcos, Newsha Ardalani

    Abstract: Code datasets, often collected from diverse and uncontrolled sources such as GitHub, potentially suffer from quality issues, thereby affecting the performance and training efficiency of Large Language Models (LLMs) optimized for code generation. Previous studies demonstrated the benefit of using embedding spaces for data pruning, but they mainly focused on duplicate removal or increasing variety,… ▽ More

    Submitted 4 December, 2023; originally announced December 2023.

    Comments: 12 pages, 4 figures, Oral Presentation at 3rd Workshop on Efficient Natural Language and Speech Processing (ENLSP-III), NeurIPS 2023

  6. arXiv:2309.07062  [pdf, other

    cs.PL cs.AI cs.CL cs.LG

    Large Language Models for Compiler Optimization

    Authors: Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather

    Abstract: We explore the novel application of Large Language Models to code optimization. We present a 7B-parameter transformer model trained from scratch to optimize LLVM assembly for code size. The model takes as input unoptimized assembly and outputs a list of compiler options to best optimize the program. Crucially, during training, we ask the model to predict the instruction counts before and after opt… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

  7. arXiv:2308.12950  [pdf, other

    cs.CL

    Code Llama: Open Foundation Models for Code

    Authors: Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom , et al. (1 additional authors not shown)

    Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama… ▽ More

    Submitted 31 January, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

  8. arXiv:2304.10379  [pdf, other

    cs.SE

    Leveraging Static Analysis for Bug Repair

    Authors: Ruba Mutasim, Gabriel Synnaeve, David Pichardie, Baptiste Rozière

    Abstract: We propose a method combining machine learning with a static analysis tool (i.e. Infer) to automatically repair source code. Machine Learning methods perform well for producing idiomatic source code. However, their output is sometimes difficult to trust as language models can output incorrect code with high confidence. Static analysis tools are trustable, but also less flexible and produce non-idi… ▽ More

    Submitted 21 April, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

    Comments: 13 pages. DL4C 2023

  9. arXiv:2302.13971  [pdf, other

    cs.CL

    LLaMA: Open and Efficient Foundation Language Models

    Authors: Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

    Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is co… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

  10. arXiv:2302.07842  [pdf, ps, other

    cs.CL

    Augmented Language Models: a Survey

    Authors: Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, Thomas Scialom

    Abstract: This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demo… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  11. arXiv:2207.03578  [pdf, other

    cs.PL cs.CL cs.LG

    Code Translation with Compiler Representations

    Authors: Marc Szafraniec, Baptiste Roziere, Hugh Leather, Francois Charton, Patrick Labatut, Gabriel Synnaeve

    Abstract: In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looki… ▽ More

    Submitted 24 April, 2023; v1 submitted 30 June, 2022; originally announced July 2022.

    Comments: 9 pages

  12. arXiv:2110.06773  [pdf, other

    cs.SE cs.CL cs.LG

    Leveraging Automated Unit Tests for Unsupervised Code Translation

    Authors: Baptiste Roziere, Jie M. Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample

    Abstract: With little to no parallel data available for programming languages, unsupervised methods are well-suited to source code translation. However, the majority of unsupervised machine translation approaches rely on back-translation, a method developed in the context of natural language translation and one that inherently involves training on noisy inputs. Unfortunately, source code is highly sensitive… ▽ More

    Submitted 16 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

  13. arXiv:2102.07492  [pdf, other

    cs.CL

    DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

    Authors: Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, Guillaume Lample

    Abstract: Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-trainin… ▽ More

    Submitted 27 October, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

  14. arXiv:2010.04542  [pdf, other

    cs.LG

    Black-Box Optimization Revisited: Improving Algorithm Selection Wizards through Massive Benchmarking

    Authors: Laurent Meunier, Herilalaina Rakotoarison, Pak Kan Wong, Baptiste Roziere, Jeremy Rapin, Olivier Teytaud, Antoine Moreau, Carola Doerr

    Abstract: Existing studies in black-box optimization for machine learning suffer from low generalizability, caused by a typically selective choice of problem instances used for training and testing different optimization algorithms. Among other issues, this practice promotes overfitting and poor-performing user guidelines. To address this shortcoming, we propose in this work a benchmark suite, OptimSuite, w… ▽ More

    Submitted 23 February, 2021; v1 submitted 8 October, 2020; originally announced October 2020.

  15. arXiv:2009.13311  [pdf, other

    cs.CV cs.LG

    EvolGAN: Evolutionary Generative Adversarial Networks

    Authors: Baptiste Roziere, Fabien Teytaud, Vlad Hosu, Hanhe Lin, Jeremy Rapin, Mariia Zameshina, Olivier Teytaud

    Abstract: We propose to use a quality estimator and evolutionary methods to search the latent space of generative adversarial networks trained on small, difficult datasets, or both. The new method leads to the generation of significantly higher quality images while preserving the original generator's diversity. Human raters preferred an image from the new version with frequency 83.7pc for Cats, 74pc for Fas… ▽ More

    Submitted 28 September, 2020; originally announced September 2020.

    Comments: accepted ACCV oral

  16. arXiv:2009.12177  [pdf, other

    cs.CV cs.LG eess.IV

    Tarsier: Evolving Noise Injection in Super-Resolution GANs

    Authors: Baptiste Roziere, Nathanal Carraz Rakotonirina, Vlad Hosu, Andry Rasoanaivo, Hanhe Lin, Camille Couprie, Olivier Teytaud

    Abstract: Super-resolution aims at increasing the resolution and level of detail within an image. The current state of the art in general single-image super-resolution is held by NESRGAN+, which injects a Gaussian noise after each residual layer at training time. In this paper, we harness evolutionary methods to improve NESRGAN+ by optimizing the noise injection at inference time. More precisely, we use Dia… ▽ More

    Submitted 25 September, 2020; originally announced September 2020.

  17. arXiv:2006.03511  [pdf, other

    cs.CL cs.PL

    Unsupervised Translation of Programming Languages

    Authors: Marie-Anne Lachaux, Baptiste Roziere, Lowik Chanussot, Guillaume Lample

    Abstract: A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to th… ▽ More

    Submitted 22 September, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

  18. arXiv:2004.14014  [pdf, other

    cs.AI cs.NE

    Versatile Black-Box Optimization

    Authors: Jialin Liu, Antoine Moreau, Mike Preuss, Baptiste Roziere, Jeremy Rapin, Fabien Teytaud, Olivier Teytaud

    Abstract: Choosing automatically the right algorithm using problem descriptors is a classical component of combinatorial optimization. It is also a good tool for making evolutionary algorithms fast, robust and versatile. We present Shiwa, an algorithm good at both discrete and continuous, noisy and noise-free, sequential and parallel, black-box optimization. Our algorithm is experimentally compared to compe… ▽ More

    Submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted at GECCO 2020

  19. arXiv:2002.03839  [pdf, other

    cs.LG stat.ML

    Adversarial Attacks on Linear Contextual Bandits

    Authors: Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta

    Abstract: Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a… ▽ More

    Submitted 23 October, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

  20. arXiv:2001.09832  [pdf, other

    cs.LG stat.ML

    Polygames: Improved Zero Learning

    Authors: Tristan Cazenave, Yen-Chi Chen, Guan-Wei Chen, Shi-Yu Chen, Xian-Dong Chiu, Julien Dehos, Maria Elsa, Qucheng Gong, Hengyuan Hu, Vasil Khalidov, Cheng-Ling Li, Hsin-I Lin, Yu-** Lin, Xavier Martinet, Vegard Mella, Jeremy Rapin, Baptiste Roziere, Gabriel Synnaeve, Fabien Teytaud, Olivier Teytaud, Shi-Cheng Ye, Yi-Jun Ye, Shi-Jim Yen, Sergey Zagoruyko

    Abstract: Since DeepMind's AlphaZero, Zero learning quickly became the state-of-the-art method for many board games. It can be improved using a fully convolutional structure (no fully connected layer). Using such an architecture plus global pooling, we can create bots independent of the board size. The training can be made more robust by kee** track of the best checkpoints during the training and by train… ▽ More

    Submitted 27 January, 2020; originally announced January 2020.

  21. arXiv:1906.11661  [pdf, other

    cs.CV cs.LG stat.ML

    Inspirational Adversarial Image Generation

    Authors: Baptiste Rozière, Morgane Riviere, Olivier Teytaud, Jérémy Rapin, Yann LeCun, Camille Couprie

    Abstract: The task of image generation started to receive some attention from artists and designers to inspire them in new creations. However, exploiting the results of deep generative models such as Generative Adversarial Networks can be long and tedious given the lack of existing tools. In this work, we propose a simple strategy to inspire creators with new generations learned from a dataset of their choi… ▽ More

    Submitted 2 April, 2021; v1 submitted 17 June, 2019; originally announced June 2019.

    Journal ref: TIP 2021