-
OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
Authors:
Srinivasan Iyer,
Xi Victoria Lin,
Ramakanth Pasunuru,
Todor Mihaylov,
Daniel Simig,
** Yu,
Kurt Shuster,
Tianlu Wang,
Qing Liu,
Punit Singh Koura,
Xian Li,
Brian O'Horo,
Gabriel Pereyra,
Jeff Wang,
Christopher Dewan,
Asli Celikyilmaz,
Luke Zettlemoyer,
Ves Stoyanov
Abstract:
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diver…
▽ More
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
△ Less
Submitted 30 January, 2023; v1 submitted 22 December, 2022;
originally announced December 2022.
-
Large scale distributed neural network training through online distillation
Authors:
Rohan Anil,
Gabriel Pereyra,
Alexandre Passos,
Robert Ormandi,
George E. Dahl,
Geoffrey E. Hinton
Abstract:
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward…
▽ More
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.
△ Less
Submitted 20 August, 2020; v1 submitted 9 April, 2018;
originally announced April 2018.
-
Size invariance sector for an agent-based innovation diffusion model
Authors:
Carlos E. Laciana,
Gustavo Pereyra,
Santiago L. Rovere
Abstract:
It is shown that under certain conditions it is possible to model a complex system in a way that leads to results that do not depend on system size. As an example of complex system an innovation diffusion model is considered. In that model a set of individuals (the agents), which are interconnected, must decide if adopt or not an innovation. The agents are connected in a member of the networks fam…
▽ More
It is shown that under certain conditions it is possible to model a complex system in a way that leads to results that do not depend on system size. As an example of complex system an innovation diffusion model is considered. In that model a set of individuals (the agents), which are interconnected, must decide if adopt or not an innovation. The agents are connected in a member of the networks family known as small worlds networks (SWN). It is found that for a subfamily of the SWN the saturation time and the form of the adoption curve are invariants respect to the change in the size of the system.
△ Less
Submitted 27 July, 2017; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Regularizing Neural Networks by Penalizing Confident Output Distributions
Authors:
Gabriel Pereyra,
George Tucker,
Jan Chorowski,
Łukasz Kaiser,
Geoffrey Hinton
Abstract:
We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the…
▽ More
We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.
△ Less
Submitted 23 January, 2017;
originally announced January 2017.
-
Batch Normalized Recurrent Neural Networks
Authors:
César Laurent,
Gabriel Pereyra,
Philémon Brakel,
Ying Zhang,
Yoshua Bengio
Abstract:
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch no…
▽ More
Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.
△ Less
Submitted 5 October, 2015;
originally announced October 2015.
-
On the relation between hydrogen bonds, tetrahedral order and molecular mobility in model water
Authors:
R. G. Pereyra,
A. Bermudez di Lorenzo,
D. C. Malaspina,
M. A. Carignano
Abstract:
We studied by molecular dynamics simulations the relation existing between the lifetime of hydrogen bonds, the tetrahedral order and the diffusion coefficient of model water. We tested four different models: SPC/E, TIP4P-Ew, TIP5P-Ew and Six-site, these last two having sites explicitly resembling the water lone pairs. While all the models perform reasonably well at ambient conditions, their behavi…
▽ More
We studied by molecular dynamics simulations the relation existing between the lifetime of hydrogen bonds, the tetrahedral order and the diffusion coefficient of model water. We tested four different models: SPC/E, TIP4P-Ew, TIP5P-Ew and Six-site, these last two having sites explicitly resembling the water lone pairs. While all the models perform reasonably well at ambient conditions, their behavior is significantly different for temperatures below 270 K. The models with explicit lone-pairs have a longer hydrogen bond lifetime, a better tetrahedral order and a smaller diffusion coefficient than the models without them.
△ Less
Submitted 13 July, 2013;
originally announced July 2013.
-
The water supercooled regime as described by four common water models
Authors:
David C. Malaspina,
Aleida J. Bermudez di Lorenzo,
Rodolfo G. Pereyra,
Igal Szleifer,
Marcelo A. Carignano
Abstract:
The temperature scale of simple water models in general does not coincide with the natural one. Therefore, in order to make a meaningful evaluation of different water models a temperature rescaling is necessary. In this paper we introduce a rescaling using the melting temperature and the temperature corresponding to the maximum of the heat capacity to evaluate four common water models (TIP4P-Ew, T…
▽ More
The temperature scale of simple water models in general does not coincide with the natural one. Therefore, in order to make a meaningful evaluation of different water models a temperature rescaling is necessary. In this paper we introduce a rescaling using the melting temperature and the temperature corresponding to the maximum of the heat capacity to evaluate four common water models (TIP4P-Ew, TIP4P-2005, TIP5P-Ew and Six-Sites) in the supercooled regime. Although all the models show the same general qualitative behavior, the TIP5P-Ew appears as the best representation of the supercooled regime when the rescaled temperature is used. We also analyze, using thermodynamic arguments, the critical nucleus size for ice growth. Finally, we speculate on the possible reasons why atomistic models do not usually crystalize while the coarse grained mW model do crystallize.
△ Less
Submitted 12 July, 2013;
originally announced July 2013.