Search | arXiv e-print repository

arXiv:2212.12017 [pdf, other]

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Authors: Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru, Todor Mihaylov, Daniel Simig, ** Yu, Kurt Shuster, Tianlu Wang, Qing Liu, Punit Singh Koura, Xian Li, Brian O'Horo, Gabriel Pereyra, Jeff Wang, Christopher Dewan, Asli Celikyilmaz, Luke Zettlemoyer, Ves Stoyanov

Abstract: Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diver… ▽ More Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework. △ Less

Submitted 30 January, 2023; v1 submitted 22 December, 2022; originally announced December 2022.

Comments: 56 pages. v2->v3: fix OPT-30B evaluation results across benchmarks (previously we reported lower performance of this model due to an evaluation pipeline bug)

arXiv:1804.03235 [pdf, other]

Large scale distributed neural network training through online distillation

Authors: Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton

Abstract: Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward… ▽ More Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data. △ Less

Submitted 20 August, 2020; v1 submitted 9 April, 2018; originally announced April 2018.

Comments: Clarify that implementations should use available parallelism in pseudo-code

arXiv:1706.03859 [pdf]

Size invariance sector for an agent-based innovation diffusion model

Authors: Carlos E. Laciana, Gustavo Pereyra, Santiago L. Rovere

Abstract: It is shown that under certain conditions it is possible to model a complex system in a way that leads to results that do not depend on system size. As an example of complex system an innovation diffusion model is considered. In that model a set of individuals (the agents), which are interconnected, must decide if adopt or not an innovation. The agents are connected in a member of the networks fam… ▽ More It is shown that under certain conditions it is possible to model a complex system in a way that leads to results that do not depend on system size. As an example of complex system an innovation diffusion model is considered. In that model a set of individuals (the agents), which are interconnected, must decide if adopt or not an innovation. The agents are connected in a member of the networks family known as small worlds networks (SWN). It is found that for a subfamily of the SWN the saturation time and the form of the adoption curve are invariants respect to the change in the size of the system. △ Less

Submitted 27 July, 2017; v1 submitted 12 June, 2017; originally announced June 2017.

Comments: 10 pages

arXiv:1701.06548 [pdf, other]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Authors: Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, Geoffrey Hinton

Abstract: We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the… ▽ More We systematically explore regularizing neural networks by penalizing low entropy output distributions. We show that penalizing low entropy output distributions, which has been shown to improve exploration in reinforcement learning, acts as a strong regularizer in supervised learning. Furthermore, we connect a maximum entropy based confidence penalty to label smoothing through the direction of the KL divergence. We exhaustively evaluate the proposed confidence penalty and label smoothing on 6 common benchmarks: image classification (MNIST and Cifar-10), language modeling (Penn Treebank), machine translation (WMT'14 English-to-German), and speech recognition (TIMIT and WSJ). We find that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers. △ Less

Submitted 23 January, 2017; originally announced January 2017.

Comments: Submitted to ICLR 2017

arXiv:1510.01378 [pdf, other]

Batch Normalized Recurrent Neural Networks

Authors: César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, Yoshua Bengio

Abstract: Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch no… ▽ More Recurrent Neural Networks (RNNs) are powerful models for sequential data that have the potential to learn long-term dependencies. However, they are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. In this paper, we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure. We also show that when applied to the input-to-hidden transitions, batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. All in all, applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial. △ Less

Submitted 5 October, 2015; originally announced October 2015.

arXiv:1307.3611 [pdf, other]

doi 10.1016/j.cplett.2012.04.022

On the relation between hydrogen bonds, tetrahedral order and molecular mobility in model water

Authors: R. G. Pereyra, A. Bermudez di Lorenzo, D. C. Malaspina, M. A. Carignano

Abstract: We studied by molecular dynamics simulations the relation existing between the lifetime of hydrogen bonds, the tetrahedral order and the diffusion coefficient of model water. We tested four different models: SPC/E, TIP4P-Ew, TIP5P-Ew and Six-site, these last two having sites explicitly resembling the water lone pairs. While all the models perform reasonably well at ambient conditions, their behavi… ▽ More We studied by molecular dynamics simulations the relation existing between the lifetime of hydrogen bonds, the tetrahedral order and the diffusion coefficient of model water. We tested four different models: SPC/E, TIP4P-Ew, TIP5P-Ew and Six-site, these last two having sites explicitly resembling the water lone pairs. While all the models perform reasonably well at ambient conditions, their behavior is significantly different for temperatures below 270 K. The models with explicit lone-pairs have a longer hydrogen bond lifetime, a better tetrahedral order and a smaller diffusion coefficient than the models without them. △ Less

Submitted 13 July, 2013; originally announced July 2013.

Comments: 13 pages, 5 figures

Journal ref: Chemical Physics Letters, v. 538, pp. 35-38 (2012)

arXiv:1307.3405 [pdf, other]

doi 10.1063/1.4812928

The water supercooled regime as described by four common water models

Authors: David C. Malaspina, Aleida J. Bermudez di Lorenzo, Rodolfo G. Pereyra, Igal Szleifer, Marcelo A. Carignano

Abstract: The temperature scale of simple water models in general does not coincide with the natural one. Therefore, in order to make a meaningful evaluation of different water models a temperature rescaling is necessary. In this paper we introduce a rescaling using the melting temperature and the temperature corresponding to the maximum of the heat capacity to evaluate four common water models (TIP4P-Ew, T… ▽ More The temperature scale of simple water models in general does not coincide with the natural one. Therefore, in order to make a meaningful evaluation of different water models a temperature rescaling is necessary. In this paper we introduce a rescaling using the melting temperature and the temperature corresponding to the maximum of the heat capacity to evaluate four common water models (TIP4P-Ew, TIP4P-2005, TIP5P-Ew and Six-Sites) in the supercooled regime. Although all the models show the same general qualitative behavior, the TIP5P-Ew appears as the best representation of the supercooled regime when the rescaled temperature is used. We also analyze, using thermodynamic arguments, the critical nucleus size for ice growth. Finally, we speculate on the possible reasons why atomistic models do not usually crystalize while the coarse grained mW model do crystallize. △ Less

Submitted 12 July, 2013; originally announced July 2013.

Comments: 8 pages, 8 figures

Showing 1–7 of 7 results for author: Pereyra, G