Skip to main content

Showing 1–50 of 70 results for author: Auli, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.08715  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Toward Joint Language Modeling for Speech Units and Text

    Authors: Ju-Chieh Chou, Chung-Ming Chien, Wei-Ning Hsu, Karen Livescu, Arun Babu, Alexis Conneau, Alexei Baevski, Michael Auli

    Abstract: Speech and text are two major forms of human language. The research community has been focusing on map** speech to text or vice versa for many years. However, in the field of language modeling, very little effort has been made to model them jointly. In light of this, we explore joint language modeling for speech units and text. Specifically, we compare different speech tokenizers to transform co… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: EMNLP findings 2023

  2. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  3. arXiv:2305.10005  [pdf, other

    cs.CL

    DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

    Authors: Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

    Abstract: In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with… ▽ More

    Submitted 16 January, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

  4. arXiv:2302.06419  [pdf, other

    eess.AS cs.AI cs.CL

    AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

    Authors: Jiachen Lian, Alexei Baevski, Wei-Ning Hsu, Michael Auli

    Abstract: Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on pr… ▽ More

    Submitted 21 January, 2024; v1 submitted 9 February, 2023; originally announced February 2023.

    Comments: 2023 ASRU

  5. arXiv:2212.07525  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

    Authors: Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli

    Abstract: Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0… ▽ More

    Submitted 15 June, 2023; v1 submitted 14 December, 2022; originally announced December 2022.

  6. arXiv:2210.10191  [pdf, other

    cs.CL cs.SD eess.AS

    Simple and Effective Unsupervised Speech Translation

    Authors: Changhan Wang, Hirofumi Inaguma, Peng-Jen Chen, Ilia Kulikov, Yun Tang, Wei-Ning Hsu, Michael Auli, Juan Pino

    Abstract: The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognit… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

  7. arXiv:2207.06405  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Masked Autoencoders that Listen

    Authors: Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer

    Abstract: This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded conte… ▽ More

    Submitted 12 January, 2023; v1 submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at NeurIPS 2022

  8. arXiv:2206.13654  [pdf, other

    cs.CL

    Wav2Vec-Aug: Improved self-supervised training with limited data

    Authors: Anuroop Sriram, Michael Auli, Alexei Baevski

    Abstract: Self-supervised learning (SSL) of speech representations has received much attention over the last few years but most work has focused on languages and domains with an abundance of unlabeled data. However, for many languages there is a shortage even in the unlabeled data which limits the effectiveness of SSL. In this work, we focus on the problem of applying SSL to domains with limited available d… ▽ More

    Submitted 27 June, 2022; originally announced June 2022.

  9. arXiv:2204.11934  [pdf, other

    cs.LG cs.SD eess.AS

    On-demand compute reduction with stochastic wav2vec 2.0

    Authors: Apoorv Vyas, Wei-Ning Hsu, Michael Auli, Alexei Baevski

    Abstract: Squeeze and Efficient Wav2vec (SEW) is a recently proposed architecture that squeezes the input to the transformer encoder for compute efficient pre-training and inference with wav2vec 2.0 (W2V2) models. In this work, we propose stochastic compression for on-demand compute reduction for W2V2 models. As opposed to using a fixed squeeze factor, we sample it uniformly during training. We further intr… ▽ More

    Submitted 25 April, 2022; originally announced April 2022.

    Comments: submitted to Interspeech, 2022

  10. arXiv:2204.05409  [pdf, other

    cs.CL

    Unified Speech-Text Pre-training for Speech Translation and Recognition

    Authors: Yun Tang, Hongyu Gong, Ning Dong, Changhan Wang, Wei-Ning Hsu, Jiatao Gu, Alexei Baevski, Xian Li, Abdelrahman Mohamed, Michael Auli, Juan Pino

    Abstract: We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition. The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning. A self-supervised speech subtask leverages unlabelled speech data, and a (self-)supervised text to text subtask makes use of abundant text training data.… ▽ More

    Submitted 11 April, 2022; originally announced April 2022.

    Comments: ACL 2022 main conference

  11. arXiv:2204.02524  [pdf, other

    cs.SD cs.CL eess.AS

    Simple and Effective Unsupervised Speech Synthesis

    Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei Baevski, James Glass

    Abstract: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe. The framework leverages recent work in unsupervised speech recognition as well as existing neural-based speech synthesis. Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus. Experiments demonstra… ▽ More

    Submitted 20 April, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: preprint, equal contribution from first two authors

  12. arXiv:2204.02492  [pdf, other

    cs.CL cs.SD eess.AS

    Towards End-to-end Unsupervised Speech Recognition

    Authors: Alexander H. Liu, Wei-Ning Hsu, Michael Auli, Alexei Baevski

    Abstract: Unsupervised speech recognition has shown great potential to make Automatic Speech Recognition (ASR) systems accessible to every language. However, existing methods still heavily rely on hand-crafted pre-processing. Similar to the trend of making supervised speech recognition end-to-end, we introduce wav2vec-U 2.0 which does away with all audio-side pre-processing and improves accuracy through bet… ▽ More

    Submitted 15 June, 2022; v1 submitted 5 April, 2022; originally announced April 2022.

    Comments: Preprint

  13. arXiv:2203.10752  [pdf, other

    cs.CL

    XTREME-S: Evaluating Cross-lingual Speech Representations

    Authors: Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan Van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

    Abstract: We introduce XTREME-S, a new benchmark to evaluate universal cross-lingual speech representations in many languages. XTREME-S covers four task families: speech recognition, classification, speech-to-text translation and retrieval. Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as w… ▽ More

    Submitted 13 April, 2022; v1 submitted 21 March, 2022; originally announced March 2022.

    Comments: Minor fix: language code for Filipino (Tagalog), "tg" -> "tl"

  14. arXiv:2203.00648  [pdf, other

    cs.CL cs.SD eess.AS

    Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

    Authors: Ramon Sanabria, Wei-Ning Hsu, Alexei Baevski, Michael Auli

    Abstract: Human speech data comprises a rich set of domain factors such as accent, syntactic and semantic variety, or acoustic environment. Previous work explores the effect of domain mismatch in automatic speech recognition between pre-training and fine-tuning as a whole but does not dissect the contribution of individual factors. In this paper, we present a controlled study to better understand the effect… ▽ More

    Submitted 11 June, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

    Comments: Accepted to IEEE ICASSP SASB 2023

  15. arXiv:2202.03555  [pdf, other

    cs.LG

    data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

    Authors: Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli

    Abstract: While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent repres… ▽ More

    Submitted 25 October, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

  16. arXiv:2111.09296  [pdf, other

    cs.CL cs.SD eess.AS

    XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale

    Authors: Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, b… ▽ More

    Submitted 16 December, 2021; v1 submitted 17 November, 2021; originally announced November 2021.

  17. arXiv:2109.11680  [pdf, other

    cs.CL cs.LG cs.SD

    Simple and Effective Zero-shot Cross-lingual Phoneme Recognition

    Authors: Qiantong Xu, Alexei Baevski, Michael Auli

    Abstract: Recent progress in self-training, self-supervised pretraining and unsupervised learning enabled well performing speech recognition systems without any labeled data. However, in many cases there is labeled data available for related languages which is not utilized by these methods. This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrain… ▽ More

    Submitted 23 September, 2021; originally announced September 2021.

  18. arXiv:2107.04082  [pdf, other

    cs.CL cs.SD eess.AS

    Improved Language Identification Through Cross-Lingual Self-Supervised Learning

    Authors: Andros Tjandra, Diptanu Gon Choudhury, Frank Zhang, Kritika Singh, Alexis Conneau, Alexei Baevski, Assaf Sela, Yatharth Saraf, Michael Auli

    Abstract: Language identification greatly impacts the success of downstream tasks such as automatic speech recognition. Recently, self-supervised speech representations learned by wav2vec 2.0 have been shown to be very effective for a range of speech tasks. We extend previous self-supervised work on language identification by experimenting with pre-trained models which were learned on real-world unconstrain… ▽ More

    Submitted 17 October, 2021; v1 submitted 8 July, 2021; originally announced July 2021.

  19. arXiv:2105.11084  [pdf, other

    cs.CL cs.SD eess.AS

    Unsupervised Speech Recognition

    Authors: Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment… ▽ More

    Submitted 2 May, 2022; v1 submitted 24 May, 2021; originally announced May 2021.

  20. arXiv:2104.06678  [pdf, ps, other

    cs.CL

    Large-Scale Self- and Semi-Supervised Learning for Speech Translation

    Authors: Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau

    Abstract: In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.6 BLEU on average on all four… ▽ More

    Submitted 14 April, 2021; originally announced April 2021.

  21. arXiv:2104.01027  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training

    Authors: Wei-Ning Hsu, Anuroop Sriram, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-supervised learning of speech representations has been a very active research area but most work is focused on a single domain such as read audio books for which there exist large quantities of labeled and unlabeled data. In this paper, we explore more general setups where the domain of the unlabeled data for pre-training data differs from the domain of the labeled data for fine-tuning, which… ▽ More

    Submitted 8 September, 2021; v1 submitted 2 April, 2021; originally announced April 2021.

  22. arXiv:2101.11040  [pdf, ps, other

    cs.CL

    A Comparison of Approaches to Document-level Machine Translation

    Authors: Zhiyi Ma, Sergey Edunov, Michael Auli

    Abstract: Document-level machine translation conditions on surrounding sentences to produce coherent translations. There has been much recent work in this area with the introduction of custom model architectures and decoding algorithms. This paper presents a systematic comparison of selected approaches from the literature on two benchmarks for which document-level phenomena evaluation suites exist. We find… ▽ More

    Submitted 26 January, 2021; originally announced January 2021.

    Comments: 10 pages, 5 tables

  23. arXiv:2012.15045  [pdf, other

    cs.CL

    Reservoir Transformers

    Authors: Sheng Shen, Alexei Baevski, Ari S. Morcos, Kurt Keutzer, Michael Auli, Douwe Kiela

    Abstract: We demonstrate that transformers obtain impressive performance even when some of the layers are randomly initialized and never updated. Inspired by old and well-established ideas in machine learning, we explore a variety of non-linear "reservoir" layers interspersed with regular transformer layers, and show improvements in wall-clock compute time until convergence, as well as overall performance,… ▽ More

    Submitted 1 June, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

    Comments: ACL 2021

  24. arXiv:2011.07164  [pdf, ps, other

    cs.CL

    Language Models not just for Pre-training: Fast Online Neural Noisy Channel Modeling

    Authors: Shruti Bhosale, Kyra Yee, Sergey Edunov, Michael Auli

    Abstract: Pre-training models on vast quantities of unlabeled data has emerged as an effective approach to improving accuracy on many NLP tasks. On the other hand, traditional machine translation has a long history of leveraging unlabeled data through noisy channel modeling. The same idea has recently been shown to achieve strong improvements for neural machine translation. Unfortunately, naïve noisy channe… ▽ More

    Submitted 13 November, 2020; originally announced November 2020.

    Comments: Accepted at WMT 2020

  25. arXiv:2010.14230  [pdf, other

    eess.AS cs.AI cs.LG cs.SD

    A Comparison of Discrete Latent Variable Models for Speech Representation Learning

    Authors: Henry Zhou, Alexei Baevski, Michael Auli

    Abstract: Neural latent variable models enable the discovery of interesting structure in speech audio data. This paper presents a comparison of two different approaches which are broadly based on predicting future time-steps or auto-encoding the input signal. Our study compares the representations learned by vq-vae and vq-wav2vec in terms of sub-word unit discovery and phoneme recognition performance. Resul… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

    Comments: 7 pages, 4 figures

  26. arXiv:2010.12829  [pdf, other

    cs.CL

    Multilingual Speech Translation with Efficient Finetuning of Pretrained Models

    Authors: Xian Li, Changhan Wang, Yun Tang, Chau Tran, Yuqing Tang, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

    Abstract: We present a simple yet effective approach to build multilingual speech-to-text (ST) translation by efficient transfer learning from pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning less than 10% of the pretrained parameters. This enab… ▽ More

    Submitted 2 January, 2021; v1 submitted 24 October, 2020; originally announced October 2020.

  27. arXiv:2010.11430  [pdf, other

    cs.LG cs.SD eess.AS

    Self-training and Pre-training are Complementary for Speech Recognition

    Authors: Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko, Paden Tomasello, Alexis Conneau, Ronan Collobert, Gabriel Synnaeve, Michael Auli

    Abstract: Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data. However, it is not clear whether they learn similar patterns or if they can be effectively combined. In this paper, we show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups. Using just 10 minutes of… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

  28. arXiv:2010.11125  [pdf, other

    cs.CL cs.LG

    Beyond English-Centric Multilingual Machine Translation

    Authors: Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin

    Abstract: Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  29. arXiv:2010.02194  [pdf, other

    cs.CL

    Self-training Improves Pre-training for Natural Language Understanding

    Authors: **gfei Du, Edouard Grave, Beliz Gunel, Vishrav Chaudhary, Onur Celebi, Michael Auli, Ves Stoyanov, Alexis Conneau

    Abstract: Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a… ▽ More

    Submitted 5 October, 2020; originally announced October 2020.

    Comments: 8 pages

  30. arXiv:2006.13979  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Unsupervised Cross-lingual Representation Learning for Speech Recognition

    Authors: Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli

    Abstract: This paper presents XLSR which learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations and jointly learns a quantization of the latents shared across languages. The resulting model is fine-tuned on labeled data and… ▽ More

    Submitted 15 December, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

  31. arXiv:2006.11477  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

    Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

    Abstract: We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments… ▽ More

    Submitted 22 October, 2020; v1 submitted 19 June, 2020; originally announced June 2020.

  32. arXiv:2003.10647  [pdf, other

    cs.LG cs.CV eess.IV

    Robust and On-the-fly Dataset Denoising for Image Classification

    Authors: Jiaming Song, Lunjia Hu, Michael Auli, Yann Dauphin, Tengyu Ma

    Abstract: Memorization in over-parameterized neural networks could severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are hard to avoid in extremely large datasets collected with weak supervision. We address this problem by reasoning counterfactually about the loss distribution of examples with uniform random labels had they were trained with the real examples,… ▽ More

    Submitted 9 April, 2020; v1 submitted 23 March, 2020; originally announced March 2020.

  33. arXiv:1911.09728  [pdf, other

    cs.CL cs.LG

    Improving Conditioning in Context-Aware Sequence to Sequence Models

    Authors: Xinyi Wang, Jason Weston, Michael Auli, Yacine Jernite

    Abstract: Neural sequence to sequence models are well established for applications which can be cast as map** a single input sequence into a single output sequence. In this work, we focus on cases where generation is conditioned on both a short query and a long context, such as abstractive question answering or document-level translation. We modify the standard sequence-to-sequence approach to make better… ▽ More

    Submitted 21 November, 2019; originally announced November 2019.

  34. arXiv:1911.03912  [pdf, other

    cs.CL cs.LG

    Effectiveness of self-supervised pre-training for speech recognition

    Authors: Alexei Baevski, Michael Auli, Abdelrahman Mohamed

    Abstract: We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization. We find the former to be more accurate since it builds a good vocabulary of the data through vq-wav2vec [1] to enable learning of effective representations in subsequent BERT training. Different to previous work, we directly fine-tune the pre-… ▽ More

    Submitted 18 May, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

  35. arXiv:1910.10073  [pdf, other

    cs.CL cs.LG

    Depth-Adaptive Transformer

    Authors: Maha Elbayad, Jiatao Gu, Edouard Grave, Michael Auli

    Abstract: State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make output predictions at different stages of the network and we investigate different ways to predict how much computation is required for a particular sequence.… ▽ More

    Submitted 14 February, 2020; v1 submitted 22 October, 2019; originally announced October 2019.

    Comments: Published as a conference paper at ICLR 2020

  36. arXiv:1910.05453  [pdf, other

    cs.CL cs.LG

    vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

    Authors: Alexei Baevski, Steffen Schneider, Michael Auli

    Abstract: We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-train… ▽ More

    Submitted 16 February, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

  37. arXiv:1909.13151  [pdf, other

    cs.CL

    The Source-Target Domain Mismatch Problem in Machine Translation

    Authors: Jiajun Shen, Peng-Jen Chen, Matt Le, Junxian He, Jiatao Gu, Myle Ott, Michael Auli, Marc'Aurelio Ranzato

    Abstract: While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in. As a result, people often talk about different things in different parts of the world. In this work we study the effect of local context in machine translation and postulate that partic… ▽ More

    Submitted 16 June, 2020; v1 submitted 28 September, 2019; originally announced September 2019.

  38. arXiv:1908.05731  [pdf, ps, other

    cs.CL

    Simple and Effective Noisy Channel Modeling for Neural Machine Translation

    Authors: Kyra Yee, Nathan Ng, Yann N. Dauphin, Michael Auli

    Abstract: Previous work on neural noisy channel modeling relied on latent variable models that incrementally process the source and target sentence. This makes decoding decisions based on partial source prefixes even though the full source is available. We pursue an alternative approach based on standard sequence to sequence models which utilize the entire source. These models perform remarkably well as cha… ▽ More

    Submitted 15 August, 2019; originally announced August 2019.

    Comments: EMNLP 2019

  39. arXiv:1908.05204  [pdf, other

    cs.CL

    On The Evaluation of Machine Translation Systems Trained With Back-Translation

    Authors: Sergey Edunov, Myle Ott, Marc'Aurelio Ranzato, Michael Auli

    Abstract: Back-translation is a widely used data augmentation technique which leverages target monolingual data. However, its effectiveness has been challenged since automatic metrics such as BLEU only show significant improvements for test examples where the source itself is a translation, or translationese. This is believed to be due to translationese inputs better matching the back-translated training da… ▽ More

    Submitted 18 August, 2020; v1 submitted 14 August, 2019; originally announced August 2019.

    Comments: ACL 2020

  40. arXiv:1907.09190  [pdf, other

    cs.CL

    ELI5: Long Form Question Answering

    Authors: Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, Michael Auli

    Abstract: We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions. The dataset comprises 270K threads from the Reddit forum ``Explain Like I'm Five'' (ELI5) where an online community provides answers to questions which are comprehensible by five year olds. Compared to existing datasets, ELI5 comprises diverse questio… ▽ More

    Submitted 22 July, 2019; originally announced July 2019.

  41. arXiv:1907.06616  [pdf, ps, other

    cs.CL

    Facebook FAIR's WMT19 News Translation Task Submission

    Authors: Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, Michael Auli, Sergey Edunov

    Abstract: This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling toolkit which rely on sampled back-translations. This… ▽ More

    Submitted 15 July, 2019; originally announced July 2019.

    Comments: 7 pages; WMT

  42. arXiv:1907.06385  [pdf, other

    cs.CL cs.LG

    GLOSS: Generative Latent Optimization of Sentence Representations

    Authors: Sidak Pal Singh, Angela Fan, Michael Auli

    Abstract: We propose a method to learn unsupervised sentence representations in a non-compositional manner based on Generative Latent Optimization. Our approach does not impose any assumptions on how words are to be combined into a sentence representation. We discuss a simple Bag of Words model as well as a variant that models word positions. Both are trained to reconstruct the sentence based on a latent co… ▽ More

    Submitted 15 July, 2019; originally announced July 2019.

  43. arXiv:1904.05862  [pdf, other

    cs.CL

    wav2vec: Unsupervised Pre-training for Speech Recognition

    Authors: Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

    Abstract: We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce… ▽ More

    Submitted 11 September, 2019; v1 submitted 11 April, 2019; originally announced April 2019.

  44. arXiv:1904.01038  [pdf, other

    cs.CL

    fairseq: A Fast, Extensible Toolkit for Sequence Modeling

    Authors: Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, Michael Auli

    Abstract: fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and supports distributed training across multiple GPUs and machines. We also support fast mixed-precision training and inference on modern GPUs. A demo video can be found… ▽ More

    Submitted 1 April, 2019; originally announced April 2019.

    Comments: NAACL 2019 Demo paper

  45. arXiv:1903.09722  [pdf, ps, other

    cs.CL

    Pre-trained Language Model Representations for Language Generation

    Authors: Sergey Edunov, Alexei Baevski, Michael Auli

    Abstract: Pre-trained language model representations have been successful in a wide range of language understanding tasks. In this paper, we examine different strategies to integrate pre-trained representations into sequence to sequence models and apply it to neural machine translation and abstractive summarization. We find that pre-trained representations are most effective when added to the encoder networ… ▽ More

    Submitted 1 April, 2019; v1 submitted 22 March, 2019; originally announced March 2019.

    Comments: NAACL 2019

  46. arXiv:1903.07785  [pdf, other

    cs.CL

    Cloze-driven Pretraining of Self-attention Networks

    Authors: Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, Michael Auli

    Abstract: We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on… ▽ More

    Submitted 18 March, 2019; originally announced March 2019.

  47. arXiv:1902.07816  [pdf, other

    cs.CL cs.LG

    Mixture Models for Diverse Machine Translation: Tricks of the Trade

    Authors: Tianxiao Shen, Myle Ott, Michael Auli, Marc'Aurelio Ranzato

    Abstract: Mixture models trained via EM are among the simplest, most widely used and well understood latent variable models in the machine learning literature. Surprisingly, these models have been hardly explored in text generation applications such as machine translation. In principle, they provide a latent variable to control generation and produce a diverse set of hypotheses. In practice, however, mixtur… ▽ More

    Submitted 24 May, 2019; v1 submitted 20 February, 2019; originally announced February 2019.

    Comments: ICML 2019 camera-ready

  48. arXiv:1901.10430  [pdf, other

    cs.CL

    Pay Less Attention with Lightweight and Dynamic Convolutions

    Authors: Felix Wu, Angela Fan, Alexei Baevski, Yann N. Dauphin, Michael Auli

    Abstract: Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient tha… ▽ More

    Submitted 22 February, 2019; v1 submitted 29 January, 2019; originally announced January 2019.

    Comments: 14 pages, ICLR oral

  49. Modeling Human Motion with Quaternion-based Neural Networks

    Authors: Dario Pavllo, Christoph Feichtenhofer, Michael Auli, David Grangier

    Abstract: Previous work on predicting or generating 3D human pose sequences regresses either joint rotations or joint positions. The former strategy is prone to error accumulation along the kinematic chain, as well as discontinuities when using Euler angles or exponential maps as parameterizations. The latter requires re-projection onto skeleton constraints to avoid bone stretching and invalid configuration… ▽ More

    Submitted 26 October, 2019; v1 submitted 21 January, 2019; originally announced January 2019.

    Comments: Follow-up work of arXiv:1805.06485. This is a pre-print of an article published in IJCV. The final authenticated version is available online at https://doi.org/10.1007/s11263-019-01245-6

    Journal ref: International Journal of Computer Vision (Special Issue on Machine Vision with Deep Learning), 2019. Online ISSN: 1573-1405

  50. arXiv:1811.11742  [pdf, other

    cs.CV

    3D human pose estimation in video with temporal convolutions and semi-supervised training

    Authors: Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli

    Abstract: In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce back-projection, a simple and effective semi-supervised training method that leverages unlabeled video data. We start with predicted 2D keypoints for unlabeled video, then estimate 3D poses and finally back-pro… ▽ More

    Submitted 29 March, 2019; v1 submitted 28 November, 2018; originally announced November 2018.

    Comments: CVPR 2019