-
Flashlight: Enabling Innovation in Tools for Machine Learning
Authors:
Jacob Kahn,
Vineel Pratap,
Tatiana Likhomanenko,
Qiantong Xu,
Awni Hannun,
Jeff Cai,
Paden Tomasello,
Ann Lee,
Edouard Grave,
Gilad Avidov,
Benoit Steiner,
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the…
▽ More
As the computational requirements for machine learning systems and the size and complexity of machine learning frameworks increases, essential framework innovation has become challenging. While computational needs have driven recent compiler, networking, and hardware advancements, utilization of those advancements by machine learning tools is occurring at a slower pace. This is in part due to the difficulties involved in prototy** new computational paradigms with existing frameworks. Large frameworks prioritize machine learning researchers and practitioners as end users and pay comparatively little attention to systems researchers who can push frameworks forward -- we argue that both are equally important stakeholders. We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems by prioritizing open, modular, customizable internals and state-of-the-art, research-ready models and training setups across a variety of domains. Flashlight allows systems researchers to rapidly prototype and experiment with novel ideas in machine learning computation and has low overhead, competing with and often outperforming other popular machine learning frameworks. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together. Flashlight is available at https://github.com/flashlight/flashlight .
△ Less
Submitted 22 June, 2022; v1 submitted 28 January, 2022;
originally announced January 2022.
-
Self-supervised Pretraining of Visual Features in the Wild
Authors:
Priya Goyal,
Mathilde Caron,
Benjamin Lefaudeux,
Min Xu,
Pengchao Wang,
Vivek Pai,
Mannat Singh,
Vitaliy Liptchinsky,
Ishan Misra,
Armand Joulin,
Piotr Bojanowski
Abstract:
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives…
▽ More
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and SwAV have reduced the gap with supervised methods. These results have been achieved in a control environment, that is the highly curated ImageNet dataset. However, the premise of self-supervised learning is that it can learn from any random image and from any unbounded dataset. In this work, we explore if self-supervision lives to its expectation by training large models on random, uncurated images with no supervision. Our final SElf-supERvised (SEER) model, a RegNetY with 1.3B parameters trained on 1B random images with 512 GPUs achieves 84.2% top-1 accuracy, surpassing the best self-supervised pretrained model by 1% and confirming that self-supervised learning works in a real world setting. Interestingly, we also observe that self-supervised models are good few-shot learners achieving 77.9% top-1 with access to only 10% of ImageNet. Code: https://github.com/facebookresearch/vissl
△ Less
Submitted 5 March, 2021; v1 submitted 2 March, 2021;
originally announced March 2021.
-
Beyond English-Centric Multilingual Machine Translation
Authors:
Angela Fan,
Shruti Bhosale,
Holger Schwenk,
Zhiyi Ma,
Ahmed El-Kishky,
Siddharth Goyal,
Mandeep Baines,
Onur Celebi,
Guillaume Wenzek,
Vishrav Chaudhary,
Naman Goyal,
Tom Birch,
Vitaliy Liptchinsky,
Sergey Edunov,
Edouard Grave,
Michael Auli,
Armand Joulin
Abstract:
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In…
▽ More
Existing work in translation demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this work, we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages. We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining. Then, we explore how to effectively increase model capacity through a combination of dense scaling and language-specific sparse parameters to create high quality models. Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT. We open-source our scripts so that others may reproduce the data, evaluation, and final M2M-100 model.
△ Less
Submitted 21 October, 2020;
originally announced October 2020.
-
Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters
Authors:
Vineel Pratap,
Anuroop Sriram,
Paden Tomasello,
Awni Hannun,
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three vari…
▽ More
We study training a single acoustic model for multiple languages with the aim of improving automatic speech recognition (ASR) performance on low-resource languages, and over-all simplifying deployment of ASR systems that support diverse languages. We perform an extensive benchmark on 51 languages, with varying amount of training data by language(from 100 hours to 1100 hours). We compare three variants of multilingual training from a single joint model without knowing the input language, to using this information, to multiple heads (one per language cluster). We show that multilingual training of ASR models on several languages can improve recognition performance, in particular, on low resource languages. We see 20.9%, 23% and 28.8% average WER relative reduction compared to monolingual baselines on joint model, joint model with language input and multi head model respectively. To our knowledge, this is the first work studying multilingual ASR at massive scale, with more than 50 languages and more than 16,000 hours of audio across them.
△ Less
Submitted 7 July, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
Large scale weakly and semi-supervised learning for low-resource video ASR
Authors:
Kritika Singh,
Vimal Manohar,
Alex Xiao,
Sergey Edunov,
Ross Girshick,
Vitaliy Liptchinsky,
Christian Fuegen,
Yatharth Saraf,
Geoffrey Zweig,
Abdelrahman Mohamed
Abstract:
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on th…
▽ More
Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8%, sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20% compared to the strongest data-augmented supervised baseline.
△ Less
Submitted 6 August, 2020; v1 submitted 15 May, 2020;
originally announced May 2020.
-
Scaling Up Online Speech Recognition Using ConvNets
Authors:
Vineel Pratap,
Qiantong Xu,
Jacob Kahn,
Gilad Avidov,
Tatiana Likhomanenko,
Awni Hannun,
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency a…
▽ More
We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency, accuracy, and discuss how these metrics can be tuned based on the user requirements.
△ Less
Submitted 27 January, 2020;
originally announced January 2020.
-
Libri-Light: A Benchmark for ASR with Limited or No Supervision
Authors:
Jacob Kahn,
Morgane Rivière,
Weiyi Zheng,
Evgeny Kharitonov,
Qiantong Xu,
Pierre-Emmanuel Mazaré,
Julien Karadayi,
Vitaliy Liptchinsky,
Ronan Collobert,
Christian Fuegen,
Tatiana Likhomanenko,
Gabriel Synnaeve,
Armand Joulin,
Abdelrahman Mohamed,
Emmanuel Dupoux
Abstract:
We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR…
▽ More
We introduce a new collection of spoken English audio suitable for training speech recognition systems under limited or no supervision. It is derived from open-source audio books from the LibriVox project. It contains over 60K hours of audio, which is, to our knowledge, the largest freely-available corpus of speech. The audio has been segmented using voice activity detection and is tagged with SNR, speaker ID and genre descriptions. Additionally, we provide baseline systems and evaluation metrics working under three settings: (1) the zero resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER, CER) and (3) the distant supervision setting (WER). Settings (2) and (3) use limited textual resources (10 minutes to 10 hours) aligned with the speech. Setting (3) uses large amounts of unaligned text. They are evaluated on the standard LibriSpeech dev and test sets for comparison with the supervised state-of-the-art.
△ Less
Submitted 17 December, 2019;
originally announced December 2019.
-
End-to-end ASR: from Supervised to Semi-Supervised Learning with Modern Architectures
Authors:
Gabriel Synnaeve,
Qiantong Xu,
Jacob Kahn,
Tatiana Likhomanenko,
Edouard Grave,
Vineel Pratap,
Anuroop Sriram,
Vitaliy Liptchinsky,
Ronan Collobert
Abstract:
We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance…
▽ More
We study pseudo-labeling for the semi-supervised training of ResNet, Time-Depth Separable ConvNets, and Transformers for speech recognition, with either CTC or Seq2Seq loss functions. We perform experiments on the standard LibriSpeech dataset, and leverage additional unlabeled data from LibriVox through pseudo-labeling. We show that while Transformer-based acoustic models have superior performance with the supervised dataset alone, semi-supervision improves all models across architectures and loss functions and bridges much of the performance gaps between them. In doing so, we reach a new state-of-the-art for end-to-end acoustic models decoded with an external language model in the standard supervised learning setting, and a new absolute state-of-the-art with semi-supervised training. Finally, we study the effect of leveraging different amounts of unlabeled audio, propose several ways of evaluating the characteristics of unlabeled audio which improve acoustic modeling, and show that acoustic models trained with more audio rely less on external language models.
△ Less
Submitted 14 July, 2020; v1 submitted 19 November, 2019;
originally announced November 2019.
-
wav2letter++: The Fastest Open-source Speech Recognition System
Authors:
Vineel Pratap,
Awni Hannun,
Qiantong Xu,
Jeff Cai,
Jacob Kahn,
Gabriel Synnaeve,
Vitaliy Liptchinsky,
Ronan Collobert
Abstract:
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster th…
▽ More
This paper introduces wav2letter++, the fastest open-source deep learning speech recognition framework. wav2letter++ is written entirely in C++, and uses the ArrayFire tensor library for maximum efficiency. Here we explain the architecture and design of the wav2letter++ system and compare it to other major open-source speech recognition systems. In some cases wav2letter++ is more than 2x faster than other optimized frameworks for training end-to-end neural networks for speech recognition. We also show that wav2letter++'s training times scale linearly to 64 GPUs, the highest we tested, for models with 100 million parameters. High-performance frameworks enable fast iteration, which is often a crucial factor in successful research and model tuning on new datasets and tasks.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.
-
Fully Convolutional Speech Recognition
Authors:
Neil Zeghidour,
Qiantong Xu,
Vitaliy Liptchinsky,
Nicolas Usunier,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language mod…
▽ More
Current state-of-the-art speech recognition systems build on recurrent neural networks for acoustic and/or language modeling, and rely on feature extraction pipelines to extract mel-filterbanks or cepstral coefficients. In this paper we present an alternative approach based solely on convolutional neural networks, leveraging recent advances in acoustic models from the raw waveform and language modeling. This fully convolutional approach is trained end-to-end to predict characters from the raw waveform, removing the feature extraction step altogether. An external convolutional language model is used to decode words. On Wall Street Journal, our model matches the current state-of-the-art. On Librispeech, we report state-of-the-art performance among end-to-end models, including Deep Speech 2 trained with 12 times more acoustic data and significantly more linguistic data.
△ Less
Submitted 9 April, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
To Reverse the Gradient or Not: An Empirical Comparison of Adversarial and Multi-task Learning in Speech Recognition
Authors:
Yossi Adi,
Neil Zeghidour,
Ronan Collobert,
Nicolas Usunier,
Vitaliy Liptchinsky,
Gabriel Synnaeve
Abstract:
Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; we expect a performance improvement with this joint training if the two tasks of speech recognition and speaker recognition share a common…
▽ More
Transcribed datasets typically contain speaker identity for each instance in the data. We investigate two ways to incorporate this information during training: Multi-Task Learning and Adversarial Learning. In multi-task learning, the goal is speaker prediction; we expect a performance improvement with this joint training if the two tasks of speech recognition and speaker recognition share a common set of underlying features. In contrast, adversarial learning is a means to learn representations invariant to the speaker. We then expect better performance if this learnt invariance helps generalizing to new speakers. While the two approaches seem natural in the context of speech recognition, they are incompatible because they correspond to opposite gradients back-propagated to the model. In order to better understand the effect of these approaches in terms of error rates, we compare both strategies in controlled settings. Moreover, we explore the use of additional untranscribed data in a semi-supervised, adversarial learning manner to improve error rates. Our results show that deep models trained on big datasets already develop invariant representations to speakers without any auxiliary loss. When considering adversarial learning and multi-task learning, the impact on the acoustic model seems minor. However, models trained in a semi-supervised manner can improve error-rates.
△ Less
Submitted 14 February, 2019; v1 submitted 9 December, 2018;
originally announced December 2018.
-
Letter-Based Speech Recognition with Gated ConvNets
Authors:
Vitaliy Liptchinsky,
Gabriel Synnaeve,
Ronan Collobert
Abstract:
In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forc…
▽ More
In the recent literature, "end-to-end" speech systems often refer to letter-based acoustic models trained in a sequence-to-sequence manner, either via a recurrent model or via a structured output learning approach (such as CTC). In contrast to traditional phone (or senone)-based approaches, these "end-to-end'' approaches alleviate the need of word pronunciation modeling, and do not require a "forced alignment" step at training time. Phone-based approaches remain however state of the art on classical benchmarks. In this paper, we propose a letter-based speech recognition system, leveraging a ConvNet acoustic model. Key ingredients of the ConvNet are Gated Linear Units and high dropout. The ConvNet is trained to map audio sequences to their corresponding letter transcriptions, either via a classical CTC approach, or via a recent variant called ASG. Coupled with a simple decoder at inference time, our system matches the best existing letter-based systems on WSJ (in word error rate), and shows near state of the art performance on LibriSpeech.
△ Less
Submitted 15 February, 2019; v1 submitted 22 December, 2017;
originally announced December 2017.