Skip to main content

Showing 1–38 of 38 results for author: Baevski, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.08715  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Toward Joint Language Modeling for Speech Units and Text

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

    Abstract: Speech and text are two major forms of human language. The research community has been focusing on map** speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: EMNLP findings 2023

  2. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  3. arXiv:2303.07798  [pdf, other

    cs.CV cs.AI

    OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav

    Authors: Karmesh Yadav, Arjun Majumdar, Ram Ramrakhya, Naoki Yokoyama, Alexei Baevski, Zsolt Kira, Oleksandr Maksymets, Dhruv Batra

    Abstract: We present a single neural network architecture composed of task-agnostic components (ViTs, convolutions, and LSTMs) that achieves state-of-art results on both the ImageNav ("go to location in <this picture>") and ObjectNav ("find a chair") tasks without any task-specific modules like object detection, segmentation, map**, or planning modules. Such general-purpose methods offer advantages of sim… ▽ More

    Submitted 14 March, 2023; originally announced March 2023.

    Comments: 15 pages, 7 figures, 9 tables

  4. arXiv:2302.06419  [pdf, other

    eess.AS cs.AI cs.CL

    AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

    Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

    Abstract: Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on pr… ▽ More

    Submitted 21 January, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

    Comments: 2023 ASRU

  5. arXiv:2212.07525  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

    Authors: Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli

    Abstract: Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0… ▽ More

    Submitted 15 June, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

  6. arXiv:2211.08402  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Introducing Semantics into Speech Encoders

    Authors: Derek Xu, Shuyan Dong, Changhan Wang, Suyoun Kim, Zhaojiang Lin, Akshat Shrivastava, Shang-Wen Li, Liang-Hsuan Tseng, Alexei Baevski, Guan-Ting Lin, Hung-yi Lee, Yizhou Sun, Wei Wang

    Abstract: Recent studies find existing self-supervised speech encoders contain primarily acoustic rather than semantic information. As a result, pipelined supervised automatic speech recognition (ASR) to large language model (LLM) systems achieve state-of-the-art results on semantic spoken language tasks by utilizing rich semantic representations from the LLM. These systems come at the cost of labeled audio… ▽ More

    Submitted 15 November, 2022; originally announced November 2022.

    Comments: 11 pages, 3 figures

  7. arXiv:2207.06405  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Masked Autoencoders that Listen

    Authors: Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

    Abstract: This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded conte… ▽ More

    Submitted 12 January, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at NeurIPS 2022

  8. arXiv:2206.13654  [pdf, other

    cs.CL

    Wav2Vec-Aug: Improved self-supervised training with limited data

    Authors: Anuroop Sriram, Michael Auli, Alexei Baevski

    Abstract: Self-supervised learning (SSL) of speech representations has received much attention over the last few years but most work has focused on languages and domains with an abundance of unlabeled data. However, for many languages there is a shortage even in the unlabeled data which limits the effectiveness of SSL. In this work, we focus on the problem of applying SSL to domains with limited available d… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

  9. arXiv:2204.13226  [pdf, other

    cs.CV cs.LG

    Offline Visual Representation Learning for Embodied Navigation

    Authors: Karmesh Yadav, Ram Ramrakhya, Arjun Majumdar, Vincent-Pierre Berges, Sachit Kuhar, Dhruv Batra, Alexei Baevski, Oleksandr Maksymets

    Abstract: How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effectiv… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: 15 pages, 4 figures, 7 tables and supplementary

  10. arXiv:2204.11934  [pdf, other

    cs.LG cs.SD eess.AS

    On-demand compute reduction with stochastic wav2vec 2.0

    Authors: Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski

    Abstract: Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture that squeezes the input to the transformer encoder for compute efficient pre-training and inference with wav2vec 2.0 (W2V2) models. In this work, we propose stochastic compression for on-demand compute reduction for W2V2 models. As opposed to using a fixed squeeze factor, we sample it uniformly during training. We further intr… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: submitted to Interspeech, 2022

  11. arXiv:2204.05409  [pdf, other

    cs.CL

    Unified Speech-Text Pre-training for Speech Translation and Recognition

    Authors: Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Pino

    Abstract: We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages unlabelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data.… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: ACL 2022 main conference

  12. arXiv:2204.02524  [pdf, other

    cs.SD cs.CL eess.AS

    Simple and Effective Unsupervised Speech Synthesis

    Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

    Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More

    Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: preprint, equal contribution from first two authors

  13. arXiv:2204.02492  [pdf, other

    cs.CL cs.SD eess.AS

    Towards End-to-end Unsupervised Speech Recognition

    Authors: Alexander H. Liu, Wei-Ning Hsu, Michael Auli, Alexei Baevski

    Abstract: Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language. However, existing methods still heavily rely on hand-crafted pre-processing. Similar to the trend of making supervised speech recognition end-to-end, we introduce wav2vec-U 2.0 which does away with all audio-side pre-processing and improves accuracy through bet… ▽ More

    Submitted 15 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Preprint

  14. arXiv:2203.00648  [pdf, other

    cs.CL cs.SD eess.AS

    Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

    Authors: Ramon Sanabria, Wei-Ning Hsu, Alexei Baevski, Michael Auli

    Abstract: Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment. Previous work explores the effect of domain mismatch in automatic speech recognition between pre-training and fine-tuning as a whole but does not dissect the contribution of individual factors. In this paper, we present a controlled study to better understand the effect… ▽ More

    Submitted 11 June, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

    Comments: Accepted to IEEE ICASSP SASB 2023

  15. arXiv:2202.03555  [pdf, other

    cs.LG

    data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

    Authors: Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

    Abstract: While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent repres… ▽ More

    Submitted 25 October, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

  16. arXiv:2111.09296  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

    Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b… ▽ More

    Submitted 16 December, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

  17. arXiv:2109.11680  [pdf, other

    cs.CL cs.LG cs.SD

    Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

    Authors: Qiantong Xu, Alexei Baevski, Michael Auli

    Abstract: Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrain… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

  18. arXiv:2107.04082  [pdf, other

    cs.CL cs.SD eess.AS

    Improved Language Identification Through Cross-Lingual Self-Supervised Learning

    Authors: Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, Michael Auli

    Abstract: Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrain… ▽ More

    Submitted 17 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

  19. arXiv:2105.11084  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Speech Recognition

    Authors: Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment… ▽ More

    Submitted 2 May, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

  20. arXiv:2104.06678  [pdf, ps, other

    cs.CL

    Large-Scale Self- and Semi-Supervised Learning for Speech Translation

    Authors: Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau

    Abstract: In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  21. arXiv:2104.01027  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

    Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which… ▽ More

    Submitted 8 September, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

  22. arXiv:2102.01192  [pdf, other

    cs.CL

    Generative Spoken Language Modeling from Raw Audio

    Authors: Kushal Lakhotia, Evgeny Kharitonov, Wei-Ning Hsu, Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh Nguyen, Jade Copet, Alexei Baevski, Adelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce Generative Spoken Language Modeling, the task of learning the acoustic and linguistic characteristics of a language from raw audio (no text, no labels), and a set of metrics to automatically evaluate the learned representations at acoustic and linguistic levels for both encoding and generation. We set up baseline systems consisting of a discrete speech encoder (returning pseudo-text u… ▽ More

    Submitted 9 September, 2021; v1 submitted 1 February, 2021; originally announced February 2021.

  23. arXiv:2012.15045  [pdf, other

    cs.CL

    Reservoir Transformers

    Authors: Sheng Shen, Alexei Baevski, Ari S. Morcos, Kurt Keutzer, Michael Auli, Douwe Kiela

    Abstract: We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance,… ▽ More

    Submitted 1 June, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

    Comments: ACL 2021

  24. arXiv:2011.11588  [pdf, other

    cs.CL cs.SD eess.AS

    The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

    Authors: Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Evgeny Kharitonov, Alexei Baevski, Ewan Dunbar, Emmanuel Dupoux

    Abstract: We introduce a new unsupervised task, spoken language modeling: the learning of linguistic representations from raw audio signals without any labels, along with the Zero Resource Speech Benchmark 2021: a suite of 4 black-box, zero-shot metrics probing for the quality of the learned models at 4 linguistic levels: phonetics, lexicon, syntax and semantics. We present the results and analyses of a com… ▽ More

    Submitted 1 December, 2020; v1 submitted 23 November, 2020; originally announced November 2020.

    Comments: 14 pages, including references and supplementary material

  25. arXiv:2010.14230  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    A Comparison of Discrete Latent Variable Models for Speech Representation Learning

    Authors: Henry Zhou, Alexei Baevski, Michael Auli

    Abstract: Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Resul… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: 7 pages, 4 figures

  26. arXiv:2010.12829  [pdf, other

    cs.CL

    Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

    Authors: Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing Tang, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning less than 10% of the pretrained parameters. This enab… ▽ More

    Submitted 2 January, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

  27. arXiv:2010.11430  [pdf, other

    cs.LG cs.SD eess.AS

    Self-training and Pre-training are Complementary for Speech Recognition

    Authors: Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  28. arXiv:2006.13979  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised Cross-lingual Representation Learning for Speech Recognition

    Authors: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

    Abstract: This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and… ▽ More

    Submitted 15 December, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

  29. arXiv:2006.11477  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

    Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

    Abstract: We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments… ▽ More

    Submitted 22 October, 2020; v1 submitted 19 June, 2020; originally announced June 2020.

  30. arXiv:1911.03912  [pdf, other

    cs.CL cs.LG

    Effectiveness of self-supervised pre-training for speech recognition

    Authors: Alexei Baevski, Michael Auli, Abdelrahman Mohamed

    Abstract: We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization. We find the former to be more accurate since it builds a good vocabulary of the data through vq-wav2vec [1] to enable learning of effective representations in subsequent BERT training. Different to previous work, we directly fine-tune the pre-… ▽ More

    Submitted 18 May, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

  31. arXiv:1910.05453  [pdf, other

    cs.CL cs.LG

    vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

    Authors: Alexei Baevski, Steffen Schneider, Michael Auli

    Abstract: We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-train… ▽ More

    Submitted 16 February, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

  32. arXiv:1907.06616  [pdf, ps, other

    cs.CL

    Facebook FAIR's WMT19 News Translation Task Submission

    Authors: Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov

    Abstract: This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling toolkit which rely on sampled back-translations. This… ▽ More

    Submitted 15 July, 2019; originally announced July 2019.

    Comments: 7 pages; WMT

  33. arXiv:1904.05862  [pdf, other

    cs.CL

    wav2vec: Unsupervised Pre-training for Speech Recognition

    Authors: Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

    Abstract: We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce… ▽ More

    Submitted 11 September, 2019; v1 submitted 11 April, 2019; originally announced April 2019.

  34. arXiv:1904.01038  [pdf, other

    cs.CL

    fairseq: A Fast, Extensible Toolkit for Sequence Modeling

    Authors: Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli

    Abstract: fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found… ▽ More

    Submitted 1 April, 2019; originally announced April 2019.

    Comments: NAACL 2019 Demo paper

  35. arXiv:1903.09722  [pdf, ps, other

    cs.CL

    Pre-trained Language Model Representations for Language Generation

    Authors: Sergey Edunov, Alexei Baevski, Michael Auli

    Abstract: Pre-trained language model representations have been successful in a wide range of language understanding tasks. In this paper, we examine different strategies to integrate pre-trained representations into sequence to sequence models and apply it to neural machine translation and abstractive summarization. We find that pre-trained representations are most effective when added to the encoder networ… ▽ More

    Submitted 1 April, 2019; v1 submitted 22 March, 2019; originally announced March 2019.

    Comments: NAACL 2019

  36. arXiv:1903.07785  [pdf, other

    cs.CL

    Cloze-driven Pretraining of Self-attention Networks

    Authors: Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, Michael Auli

    Abstract: We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on… ▽ More

    Submitted 18 March, 2019; originally announced March 2019.

  37. arXiv:1901.10430  [pdf, other

    cs.CL

    Pay Less Attention with Lightweight and Dynamic Convolutions

    Authors: Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli

    Abstract: Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient tha… ▽ More

    Submitted 22 February, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

    Comments: 14 pages, ICLR oral

  38. arXiv:1809.10853  [pdf, other

    cs.CL

    Adaptive Input Representations for Neural Language Modeling

    Authors: Alexei Baevski, Michael Auli

    Abstract: We introduce adaptive input representations for neural language modeling which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices for a self-attentional architecture.… ▽ More

    Submitted 22 February, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

    Comments: 12 pages