Skip to main content

Showing 1–50 of 54 results for author: Grave, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.04496  [pdf, other

    cs.CL cs.AI cs.LG

    Time Sensitive Knowledge Editing through Efficient Finetuning

    Authors: Xiou Ge, Ali Mousavi, Edouard Grave, Armand Joulin, Kun Qian, Benjamin Han, Mostafa Arefiyan, Yunyao Li

    Abstract: Large Language Models (LLMs) have demonstrated impressive capability in different tasks and are bringing transformative changes to many domains. However, kee** the knowledge in LLMs up-to-date remains a challenge once pretraining is complete. It is thus essential to design effective methods to both update obsolete knowledge and induce new knowledge into LLMs. Existing locate-and-edit knowledge e… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 main conference

  2. arXiv:2311.13581  [pdf, other

    cs.CL

    PaSS: Parallel Speculative Sampling

    Authors: Giovanni Monea, Armand Joulin, Edouard Grave

    Abstract: Scaling the size of language models to tens of billions of parameters has led to impressive performance on a wide range of tasks. At generation, these models are used auto-regressively, requiring a forward pass for each generated token, and thus reading the full set of parameters from memory. This memory access forms the primary bottleneck for generation and it worsens as the model size increases.… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: Accepted at the 3rd workshop on Efficient Natural Language and Speech Processing (ENLSP, NeurIPS 2023)

  3. arXiv:2302.13971  [pdf, other

    cs.CL

    LLaMA: Open and Efficient Foundation Language Models

    Authors: Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample

    Abstract: We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is co… ▽ More

    Submitted 27 February, 2023; originally announced February 2023.

  4. arXiv:2302.07842  [pdf, ps, other

    cs.CL

    Augmented Language Models: a Survey

    Authors: Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, Thomas Scialom

    Abstract: This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. The former is defined as decomposing a potentially complex task into simpler subtasks while the latter consists in calling external modules such as a code interpreter. LMs can leverage these augmentations separately or in combination via heuristics, or learn to do so from demo… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  5. arXiv:2209.13331  [pdf, other

    cs.CL cs.LG

    EditEval: An Instruction-Based Benchmark for Text Improvements

    Authors: Jane Dwivedi-Yu, Timo Schick, Zhengbao Jiang, Maria Lomeli, Patrick Lewis, Gautier Izacard, Edouard Grave, Sebastian Riedel, Fabio Petroni

    Abstract: Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the style more consistent. Even so, comprehensive evaluation of a model's capacity to perform th… ▽ More

    Submitted 27 September, 2022; originally announced September 2022.

  6. arXiv:2208.11663  [pdf, other

    cs.CL

    PEER: A Collaborative Language Model

    Authors: Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick Lewis, Gautier Izacard, Qingfei You, Christoforos Nalmpantis, Edouard Grave, Sebastian Riedel

    Abstract: Textual content is often the output of a collaborative writing process: We start with an initial draft, ask for suggestions, and repeatedly make changes. Agnostic of this process, today's language models are trained to generate only the final result. As a consequence, they lack several abilities crucial for collaborative writing: They are unable to update existing texts, difficult to control and i… ▽ More

    Submitted 24 August, 2022; originally announced August 2022.

  7. arXiv:2208.03299  [pdf, other

    cs.CL

    Atlas: Few-shot Learning with Retrieval Augmented Language Models

    Authors: Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, Edouard Grave

    Abstract: Large language models have shown impressive few-shot results on a wide range of tasks. However, when knowledge is key for such results, as is the case for tasks such as question answering and fact checking, massive parameter counts to store knowledge seem to be needed. Retrieval augmented models are known to excel at knowledge intensive tasks without the need for as many parameters, but it is uncl… ▽ More

    Submitted 16 November, 2022; v1 submitted 5 August, 2022; originally announced August 2022.

  8. arXiv:2207.06220  [pdf, other

    cs.IR cs.AI

    Improving Wikipedia Verifiability with AI

    Authors: Fabio Petroni, Samuel Broscheit, Aleksandra Piktus, Patrick Lewis, Gautier Izacard, Lucas Hosseini, Jane Dwivedi-Yu, Maria Lomeli, Timo Schick, Pierre-Emmanuel Mazaré, Armand Joulin, Edouard Grave, Sebastian Riedel

    Abstract: Verifiability is a core content policy of Wikipedia: claims that are likely to be challenged need to be backed by citations. There are millions of articles available online and thousands of new articles are released each month. For this reason, finding relevant sources is a difficult task: many claims do not have any references that support them. Furthermore, even existing citations might not supp… ▽ More

    Submitted 8 July, 2022; originally announced July 2022.

  9. arXiv:2201.12465  [pdf, other

    cs.LG cs.AI cs.DC

    Flashlight: Enabling Innovation in Tools for Machine Learning

    Authors: Jacob Kahn, Vineel Pratap, Tatiana Likhomanenko, Qiantong Xu, Awni Hannun, Jeff Cai, Paden Tomasello, Ann Lee, Edouard Grave, Gilad Avidov, Benoit Steiner, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

    Abstract: As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the… ▽ More

    Submitted 22 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Presented at ICML 2022

  10. arXiv:2112.10740  [pdf, other

    cs.CV

    Are Large-scale Datasets Necessary for Self-Supervised Pre-training?

    Authors: Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan Laptev, Hervé Jegou, Edouard Grave

    Abstract: Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are… ▽ More

    Submitted 20 December, 2021; originally announced December 2021.

  11. arXiv:2112.09924  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    The Web Is Your Oyster - Knowledge-Intensive NLP against a Very Large Web Corpus

    Authors: Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Dmytro Okhonko, Samuel Broscheit, Gautier Izacard, Patrick Lewis, Barlas Oğuz, Edouard Grave, Wen-tau Yih, Sebastian Riedel

    Abstract: In order to address increasing demands of real-world applications, the research for knowledge-intensive NLP (KI-NLP) should advance by capturing the challenges of a truly open-domain environment: web-scale knowledge, lack of structure, inconsistent quality and noise. To this end, we propose a new setup for evaluating existing knowledge intensive tasks in which we generalize the background corpus t… ▽ More

    Submitted 24 May, 2022; v1 submitted 18 December, 2021; originally announced December 2021.

  12. arXiv:2112.09118  [pdf, other

    cs.IR cs.AI cs.CL

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Authors: Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, Edouard Grave

    Abstract: Recently, information retrieval has seen the emergence of dense retrievers, using neural networks, as an alternative to classical sparse methods based on term-frequency. These models have obtained state-of-the-art results on datasets and tasks where large training sets are available. However, they do not transfer well to new applications with no training data, and are outperformed by unsupervised… ▽ More

    Submitted 29 August, 2022; v1 submitted 16 December, 2021; originally announced December 2021.

  13. arXiv:2105.03404  [pdf, other

    cs.CV

    ResMLP: Feedforward networks for image classification with data-efficient training

    Authors: Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, Hervé Jégou

    Abstract: We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using hea… ▽ More

    Submitted 10 June, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

  14. arXiv:2101.00133  [pdf, other

    cs.CL cs.AI

    NeurIPS 2020 EfficientQA Competition: Systems, Analyses and Lessons Learned

    Authors: Sewon Min, Jordan Boyd-Graber, Chris Alberti, Danqi Chen, Eunsol Choi, Michael Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, Jennimaria Palomaki, Colin Raffel, Adam Roberts, Tom Kwiatkowski, Patrick Lewis, Yuxiang Wu, Heinrich Küttler, Linqing Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, Fabio Petroni, Lucas Hosseini , et al. (28 additional authors not shown)

    Abstract: We review the EfficientQA competition from NeurIPS 2020. The competition focused on open-domain question answering (QA), where systems take natural language questions as input and return natural language answers. The aim of the competition was to build systems that can predict correct answers while also satisfying strict on-disk memory budgets. These memory budgets were designed to encourage conte… ▽ More

    Submitted 19 September, 2021; v1 submitted 31 December, 2020; originally announced January 2021.

    Comments: 26 pages; Published in Proceedings of Machine Learning Research (PMLR), NeurIPS 2020 Competition and Demonstration Track

  15. arXiv:2012.15156  [pdf, other

    cs.CL

    A Memory Efficient Baseline for Open Domain Question Answering

    Authors: Gautier Izacard, Fabio Petroni, Lucas Hosseini, Nicola De Cao, Sebastian Riedel, Edouard Grave

    Abstract: Recently, retrieval systems based on dense representations have led to important improvements in open-domain question answering, and related tasks. While very effective, this approach is also memory intensive, as the dense vectors for the whole knowledge source need to be kept in memory. In this paper, we study how the memory footprint of dense retriever-reader systems can be reduced. We consider… ▽ More

    Submitted 30 December, 2020; originally announced December 2020.

  16. arXiv:2012.04584  [pdf, ps, other

    cs.CL cs.LG

    Distilling Knowledge from Reader to Retriever for Question Answering

    Authors: Gautier Izacard, Edouard Grave

    Abstract: The task of information retrieval is an important component of many natural language processing systems, such as open domain question answering. While traditional methods were based on hand-crafted features, continuous representations based on neural networks recently obtained competitive results. A challenge of using such methods is to obtain supervised data to train the retriever model, correspo… ▽ More

    Submitted 4 August, 2022; v1 submitted 8 December, 2020; originally announced December 2020.

  17. arXiv:2010.11125  [pdf, other

    cs.CL cs.LG

    Beyond English-Centric Multilingual Machine Translation

    Authors: Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin

    Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  18. arXiv:2010.02194  [pdf, other

    cs.CL

    Self-training Improves Pre-training for Natural Language Understanding

    Authors: **gfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Ves Stoyanov, Alexis Conneau

    Abstract: Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: 8 pages

  19. arXiv:2007.01282  [pdf, other

    cs.CL cs.LG

    Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

    Authors: Gautier Izacard, Edouard Grave

    Abstract: Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-a… ▽ More

    Submitted 3 February, 2021; v1 submitted 2 July, 2020; originally announced July 2020.

  20. arXiv:2004.07320  [pdf, other

    cs.LG stat.ML

    Training with Quantization Noise for Extreme Model Compression

    Authors: Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, Armand Joulin

    Abstract: We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression… ▽ More

    Submitted 28 February, 2021; v1 submitted 15 April, 2020; originally announced April 2020.

  21. arXiv:2002.09402  [pdf, other

    cs.LG cs.CL stat.ML

    Addressing Some Limitations of Transformers with Feedback Memory

    Authors: Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, Sainbayar Sukhbaatar

    Abstract: Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in parallel. While this parallelization makes them computationally efficient, it restricts the model from fully exploiting the sequential nature of the input. The… ▽ More

    Submitted 25 January, 2021; v1 submitted 21 February, 2020; originally announced February 2020.

  22. arXiv:1911.08460  [pdf, ps, other

    cs.CL cs.SD eess.AS

    End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures

    Authors: Gabriel Synnaeve, Qiantong Xu, Jacob Kahn, Tatiana Likhomanenko, Edouard Grave, Vineel Pratap, Anuroop Sriram, Vitaliy Liptchinsky, Ronan Collobert

    Abstract: We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance… ▽ More

    Submitted 14 July, 2020; v1 submitted 19 November, 2019; originally announced November 2019.

    Comments: Published at the workshop on Self-supervision in Audio and Speech (SAS) at the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria

  23. arXiv:1911.04944  [pdf, other

    cs.CL

    CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB

    Authors: Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin

    Abstract: We show that margin-based bitext mining in a multilingual sentence space can be applied to monolingual corpora of billions of sentences. We are using ten snapshots of a curated common crawl corpus (Wenzek et al., 2019) totalling 32.7 billion unique sentences. Using one unified approach for 38 languages, we were able to mine 4.5 billions parallel sentences, out of which 661 million are aligned with… ▽ More

    Submitted 1 May, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

    Comments: 13 pages, 4 figures. arXiv admin note: text overlap with arXiv:1907.05791

  24. arXiv:1911.02116  [pdf, other

    cs.CL

    Unsupervised Cross-lingual Representation Learning at Scale

    Authors: Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov

    Abstract: This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lin… ▽ More

    Submitted 7 April, 2020; v1 submitted 5 November, 2019; originally announced November 2019.

    Comments: ACL 2020 (+ updated results)

  25. arXiv:1911.00359  [pdf, other

    cs.CL cs.IR cs.LG stat.ML

    CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

    Authors: Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, Edouard Grave

    Abstract: Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline… ▽ More

    Submitted 14 November, 2019; v1 submitted 1 November, 2019; originally announced November 2019.

  26. arXiv:1910.10073  [pdf, other

    cs.CL cs.LG

    Depth-Adaptive Transformer

    Authors: Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli

    Abstract: State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence.… ▽ More

    Submitted 14 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Published as a conference paper at ICLR 2020

  27. arXiv:1910.06241  [pdf, ps, other

    cs.CL

    Updating Pre-trained Word Vectors and Text Classifiers using Monolingual Alignment

    Authors: Piotr Bojanowski, Onur Celebi, Tomas Mikolov, Edouard Grave, Armand Joulin

    Abstract: In this paper, we focus on the problem of adapting word vector-based models to new textual data. Given a model pre-trained on large reference data, how can we adapt it to a smaller piece of data with a slightly different language distribution? We frame the adaptation problem as a monolingual word vector alignment problem, and simply average models after alignment. We align vectors using the RCSLS… ▽ More

    Submitted 15 October, 2019; v1 submitted 14 October, 2019; originally announced October 2019.

  28. arXiv:1909.11556  [pdf, other

    cs.LG cs.CL stat.ML

    Reducing Transformer Depth on Demand with Structured Dropout

    Authors: Angela Fan, Edouard Grave, Armand Joulin

    Abstract: Overparameterized transformer networks have obtained state of the art results in various natural language processing tasks, such as machine translation, language modeling, and question answering. These models contain hundreds of millions of parameters, necessitating a large amount of computation and making them prone to overfitting. In this work, we explore LayerDrop, a form of structured dropout,… ▽ More

    Submitted 25 September, 2019; originally announced September 2019.

  29. arXiv:1909.02855  [pdf, other

    cs.CL

    Don't Forget the Long Tail! A Comprehensive Analysis of Morphological Generalization in Bilingual Lexicon Induction

    Authors: Paula Czarnowska, Sebastian Ruder, Edouard Grave, Ryan Cotterell, Ann Copestake

    Abstract: Human translators routinely have to translate rare inflections of words - due to the Zipfian distribution of words in a language. When translating from Spanish, a good translator would have no problem identifying the proper translation of a statistically rare inflection such as habláramos. Note the lexeme itself, hablar, is relatively common. In this work, we investigate whether state-of-the-art b… ▽ More

    Submitted 22 October, 2019; v1 submitted 6 September, 2019; originally announced September 2019.

    Comments: EMNLP 2019

  30. arXiv:1907.01470  [pdf, other

    cs.LG cs.CL stat.ML

    Augmenting Self-attention with Persistent Memory

    Authors: Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, Armand Joulin

    Abstract: Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long term dependencies and are often regarded as the key ingredient in the success of Transformers. Building upon this intuition, we propose a new model that solely… ▽ More

    Submitted 2 July, 2019; originally announced July 2019.

  31. arXiv:1905.09755  [pdf, other

    cs.CL cs.LG

    Misspelling Oblivious Word Embeddings

    Authors: Bora Edizel, Aleksandra Piktus, Piotr Bojanowski, Rui Ferreira, Edouard Grave, Fabrizio Silvestri

    Abstract: In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded clos… ▽ More

    Submitted 23 May, 2019; originally announced May 2019.

    Comments: 9 Pages

  32. arXiv:1905.07799  [pdf, other

    cs.LG stat.ML

    Adaptive Attention Span in Transformers

    Authors: Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin

    Abstract: We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 an… ▽ More

    Submitted 8 August, 2019; v1 submitted 19 May, 2019; originally announced May 2019.

    Comments: Accepted to ACL 2019

  33. arXiv:1812.10860  [pdf, other

    cs.CL

    Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling

    Authors: Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning **, Berlin Chen, Benjamin Van Durme, Edouard Grave, Ellie Pavlick, Samuel R. Bowman

    Abstract: Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our prim… ▽ More

    Submitted 22 July, 2019; v1 submitted 27 December, 2018; originally announced December 2018.

    Comments: ACL 2019. This paper supercedes "Looking for ELMo's Friends: Sentence-Level Pretraining Beyond Language Modeling", an earlier version of this work by the same authors

  34. arXiv:1811.01124  [pdf, other

    cs.CL cs.LG

    Unsupervised Hyperalignment for Multilingual Word Embeddings

    Authors: Jean Alaux, Edouard Grave, Marco Cuturi, Armand Joulin

    Abstract: We consider the problem of aligning continuous word representations, learned in multiple languages, to a common space. It was recently shown that, in the case of two languages, it is possible to learn such a map** without supervision. This paper extends this line of work to the problem of aligning multiple languages to a common space. A solution is to independently map all languages to a pivot l… ▽ More

    Submitted 4 June, 2019; v1 submitted 2 November, 2018; originally announced November 2018.

    Comments: ICLR 2019

  35. arXiv:1805.11222  [pdf, other

    cs.LG cs.CL stat.ML

    Unsupervised Alignment of Embeddings with Wasserstein Procrustes

    Authors: Edouard Grave, Armand Joulin, Quentin Berthet

    Abstract: We consider the task of aligning two sets of points in high dimension, which has many applications in natural language processing and computer vision. As an example, it was recently shown that it is possible to infer a bilingual lexicon, without supervised data, by aligning word embeddings trained on monolingual data. These recent advances are based on adversarial training to learn the map** bet… ▽ More

    Submitted 28 May, 2018; originally announced May 2018.

  36. arXiv:1804.07745  [pdf, other

    cs.CL cs.LG

    Loss in Translation: Learning Bilingual Word Map** with a Retrieval Criterion

    Authors: Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Herve Jegou, Edouard Grave

    Abstract: Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a least-square regression problem to learn a rotation aligning a small bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval crit… ▽ More

    Submitted 5 September, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

  37. arXiv:1804.07705  [pdf, other

    cs.CL

    Lightweight Adaptive Mixture of Neural and N-gram Language Models

    Authors: Anton Bakhtin, Arthur Szlam, Marc'Aurelio Ranzato, Edouard Grave

    Abstract: It is often the case that the best performing language model is an ensemble of a neural language model with n-grams. In this work, we propose a method to improve how these two models are combined. By using a small network which predicts the mixture weight between the two models, we adapt their relative importance at each time step. Because the gating network is small, it trains quickly on small am… ▽ More

    Submitted 26 October, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

  38. arXiv:1803.11138  [pdf, other

    cs.CL

    Colorless green recurrent networks dream hierarchically

    Authors: Kristina Gulordava, Piotr Bojanowski, Edouard Grave, Tal Linzen, Marco Baroni

    Abstract: Recurrent neural networks (RNNs) have achieved impressive results in a variety of linguistic processing tasks, suggesting that they can induce non-trivial properties of language. We investigate here to what extent RNNs learn to track abstract hierarchical syntactic structure. We test whether RNNs trained with a generic language modeling objective in four languages (Italian, English, Hebrew, Russia… ▽ More

    Submitted 29 March, 2018; originally announced March 2018.

    Comments: Accepted to NAACL 2018

  39. arXiv:1802.06893  [pdf, ps, other

    cs.CL cs.LG

    Learning Word Vectors for 157 Languages

    Authors: Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas Mikolov

    Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word repr… ▽ More

    Submitted 28 March, 2018; v1 submitted 19 February, 2018; originally announced February 2018.

    Comments: Accepted to LREC

  40. arXiv:1802.02892  [pdf, ps, other

    cs.CL cs.AI cs.CV

    Efficient Large-Scale Multi-Modal Classification

    Authors: D. Kiela, E. Grave, A. Joulin, T. Mikolov

    Abstract: While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantities… ▽ More

    Submitted 6 February, 2018; originally announced February 2018.

    Comments: Published at AAAI-18, 7 pages

  41. arXiv:1712.09405  [pdf, ps, other

    cs.CL

    Advances in Pre-Training Distributed Word Representations

    Authors: Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin

    Abstract: Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available… ▽ More

    Submitted 26 December, 2017; originally announced December 2017.

  42. arXiv:1711.02604  [pdf, other

    cs.LG cs.CL

    Unbounded cache model for online language modeling with open vocabulary

    Authors: Edouard Grave, Moustapha Cisse, Armand Joulin

    Abstract: Recently, continuous cache models were proposed as extensions to recurrent neural network language models, to adapt their predictions to local changes in the data distribution. These models only capture the local context, of up to a few thousands tokens. In this paper, we propose an extension of continuous cache models, which can scale to larger contexts. In particular, we use a large scale non-pa… ▽ More

    Submitted 7 November, 2017; originally announced November 2017.

    Comments: Accepted to NIPS 2017

  43. arXiv:1710.10881  [pdf, ps, other

    stat.ML cs.LG

    Fast Linear Model for Knowledge Graph Embeddings

    Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Maximilian Nickel, Tomas Mikolov

    Abstract: This paper shows that a simple baseline based on a Bag-of-Words (BoW) representation learns surprisingly good knowledge graph embeddings. By casting knowledge base completion and question answering as supervised classification problems, we observe that modeling co-occurences of entities and relations leads to state-of-the-art performance with a training time of a few minutes using the open sourced… ▽ More

    Submitted 30 October, 2017; originally announced October 2017.

    Comments: Submitted AKBC 2017

  44. arXiv:1704.08847  [pdf, other

    stat.ML cs.AI cs.CR cs.LG

    Parseval Networks: Improving Robustness to Adversarial Examples

    Authors: Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, Nicolas Usunier

    Abstract: We introduce Parseval networks, a form of deep neural networks in which the Lipschitz constant of linear, convolutional and aggregation layers is constrained to be smaller than 1. Parseval networks are empirically and theoretically motivated by an analysis of the robustness of the predictions made by deep neural networks when their input is subject to an adversarial perturbation. The most importan… ▽ More

    Submitted 1 May, 2017; v1 submitted 28 April, 2017; originally announced April 2017.

    Comments: submitted

  45. arXiv:1612.04426  [pdf, other

    cs.CL cs.LG

    Improving Neural Language Models with a Continuous Cache

    Authors: Edouard Grave, Armand Joulin, Nicolas Usunier

    Abstract: We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them through a dot product with the current hidden activation. This mechanism is very efficient and scales to very large memory sizes. We also draw a link between the us… ▽ More

    Submitted 13 December, 2016; originally announced December 2016.

    Comments: Submitted to ICLR 2017

  46. arXiv:1612.03651  [pdf, other

    cs.CL cs.LG

    FastText.zip: Compressing text classification models

    Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, Tomas Mikolov

    Abstract: We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantizati… ▽ More

    Submitted 12 December, 2016; originally announced December 2016.

    Comments: Submitted to ICLR 2017

  47. arXiv:1611.06188  [pdf, other

    stat.ML cs.AI cs.CL cs.LG

    Variable Computation in Recurrent Neural Networks

    Authors: Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov

    Abstract: Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video… ▽ More

    Submitted 2 March, 2017; v1 submitted 18 November, 2016; originally announced November 2016.

  48. arXiv:1609.04309  [pdf, other

    cs.CL cs.LG

    Efficient softmax approximation for GPUs

    Authors: Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou

    Abstract: We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly minimize the expectation of computation time. Our approach further reduces the computational time by… ▽ More

    Submitted 19 June, 2017; v1 submitted 14 September, 2016; originally announced September 2016.

    Comments: Accepted to ICML 2017

  49. arXiv:1607.04606  [pdf, other

    cs.CL cs.LG

    Enriching Word Vectors with Subword Information

    Authors: Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

    Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgra… ▽ More

    Submitted 19 June, 2017; v1 submitted 15 July, 2016; originally announced July 2016.

    Comments: Accepted to TACL. The two first authors contributed equally

  50. arXiv:1607.01759  [pdf, ps, other

    cs.CL

    Bag of Tricks for Efficient Text Classification

    Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov

    Abstract: This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a… ▽ More

    Submitted 9 August, 2016; v1 submitted 6 July, 2016; originally announced July 2016.