-
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models
Authors:
Sheng Shen,
Le Hou,
Yanqi Zhou,
Nan Du,
Shayne Longpre,
Jason Wei,
Hyung Won Chung,
Barret Zoph,
William Fedus,
Xinyun Chen,
Tu Vu,
Yuexin Wu,
Wuyang Chen,
Albert Webson,
Yunxuan Li,
Vincent Zhao,
Hongkun Yu,
Kurt Keutzer,
Trevor Darrell,
Denny Zhou
Abstract:
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we…
▽ More
Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we conduct empirical studies across three experimental setups: (i) Direct finetuning on individual downstream tasks devoid of instruction tuning; (ii) Instructiontuning followed by in-context few-shot or zero-shot generalization on downstream tasks; and (iii) Instruction tuning supplemented by further finetuning on individual downstream tasks. In the first scenario, MoE models overall underperform dense models of identical computational capacity. This narrative, however, dramatically changes with the introduction of instruction tuning (second and third scenario), used independently or in conjunction with task-specific finetuning. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks, while using only a third of the FLOPs. The advancements embodied byFLAN-MOE inspire a reevaluation of the design principles of large-scale, high-performance language models in the framework of task-agnostic learning.
△ Less
Submitted 5 July, 2023; v1 submitted 24 May, 2023;
originally announced May 2023.
-
Scaling Instruction-Finetuned Language Models
Authors:
Hyung Won Chung,
Le Hou,
Shayne Longpre,
Barret Zoph,
Yi Tay,
William Fedus,
Yunxuan Li,
Xuezhi Wang,
Mostafa Dehghani,
Siddhartha Brahma,
Albert Webson,
Shixiang Shane Gu,
Zhuyun Dai,
Mirac Suzgun,
Xinyun Chen,
Aakanksha Chowdhery,
Alex Castro-Ros,
Marie Pellat,
Kevin Robinson,
Dasha Valter,
Sharan Narang,
Gaurav Mishra,
Adams Yu,
Vincent Zhao,
Yan** Huang
, et al. (10 additional authors not shown)
Abstract:
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d…
▽ More
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
△ Less
Submitted 6 December, 2022; v1 submitted 20 October, 2022;
originally announced October 2022.
-
A Review of Sparse Expert Models in Deep Learning
Authors:
William Fedus,
Jeff Dean,
Barret Zoph
Abstract:
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute…
▽ More
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
△ Less
Submitted 4 September, 2022;
originally announced September 2022.
-
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Authors:
Yi Tay,
Mostafa Dehghani,
Samira Abnar,
Hyung Won Chung,
William Fedus,
**feng Rao,
Sharan Narang,
Vinh Q. Tran,
Dani Yogatama,
Donald Metzler
Abstract:
There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (trans…
▽ More
There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (transfer)? This paper conducts a systematic study of scaling behaviour of ten diverse model architectures such as Transformers, Switch Transformers, Universal Transformers, Dynamic convolutions, Performers, and recently proposed MLP-Mixers. Via extensive experiments, we show that (1) architecture is an indeed an important consideration when performing scaling and (2) the best performing model can fluctuate at different scales. We believe that the findings outlined in this work has significant implications to how model architectures are currently evaluated in the community.
△ Less
Submitted 21 July, 2022;
originally announced July 2022.
-
Emergent Abilities of Large Language Models
Authors:
Jason Wei,
Yi Tay,
Rishi Bommasani,
Colin Raffel,
Barret Zoph,
Sebastian Borgeaud,
Dani Yogatama,
Maarten Bosma,
Denny Zhou,
Donald Metzler,
Ed H. Chi,
Tatsunori Hashimoto,
Oriol Vinyals,
Percy Liang,
Jeff Dean,
William Fedus
Abstract:
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot…
▽ More
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
△ Less
Submitted 26 October, 2022; v1 submitted 15 June, 2022;
originally announced June 2022.
-
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Authors:
Aarohi Srivastava,
Abhinav Rastogi,
Abhishek Rao,
Abu Awal Md Shoeb,
Abubakar Abid,
Adam Fisch,
Adam R. Brown,
Adam Santoro,
Aditya Gupta,
Adrià Garriga-Alonso,
Agnieszka Kluska,
Aitor Lewkowycz,
Akshat Agarwal,
Alethea Power,
Alex Ray,
Alex Warstadt,
Alexander W. Kocurek,
Ali Safaya,
Ali Tazarv,
Alice Xiang,
Alicia Parrish,
Allen Nie,
Aman Hussain,
Amanda Askell,
Amanda Dsouza
, et al. (426 additional authors not shown)
Abstract:
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur…
▽ More
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
△ Less
Submitted 12 June, 2023; v1 submitted 9 June, 2022;
originally announced June 2022.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Authors:
Barret Zoph,
Irwan Bello,
Sameer Kumar,
Nan Du,
Yan** Huang,
Jeff Dean,
Noam Shazeer,
William Fedus
Abstract:
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine…
▽ More
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning. Our work focuses on these issues and acts as a design guide. We conclude by scaling a sparse model to 269B parameters, with a computational cost comparable to a 32B dense encoder-decoder Transformer (Stable and Transferable Mixture-of-Experts or ST-MoE-32B). For the first time, a sparse model achieves state-of-the-art performance in transfer learning, across a diverse set of tasks including reasoning (SuperGLUE, ARC Easy, ARC Challenge), summarization (XSum, CNN-DM), closed book question answering (WebQA, Natural Questions), and adversarially constructed tasks (Winogrande, ANLI R3).
△ Less
Submitted 29 April, 2022; v1 submitted 17 February, 2022;
originally announced February 2022.
-
On Bonus-Based Exploration Methods in the Arcade Learning Environment
Authors:
Adrien Ali Taïga,
William Fedus,
Marlos C. Machado,
Aaron Courville,
Marc G. Bellemare
Abstract:
Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-base…
▽ More
Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-based exploration methods within a common evaluation framework. We combine Rainbow (Hessel et al., 2018) with different exploration bonuses and evaluate its performance on Montezuma's Revenge, Bellemare et al.'s set of hard of exploration games with sparse rewards, and the whole Atari 2600 suite. We find that while exploration bonuses lead to higher score on Montezuma's Revenge they do not provide meaningful gains over the simpler $ε$-greedy scheme. In fact, we find that methods that perform best on that game often underperform $ε$-greedy on easy exploration Atari 2600 games. We find that our conclusions remain valid even when hyperparameters are tuned for these easy-exploration games. Finally, we find that none of the methods surveyed benefit from additional training samples (1 billion frames, versus Rainbow's 200 million) on Bellemare et al.'s hard exploration games. Our results suggest that recent gains in Montezuma's Revenge may be better attributed to architecture change, rather than better exploration schemes; and that the real pace of progress in exploration research for Atari 2600 games may have been obfuscated by good results on a single domain.
△ Less
Submitted 22 September, 2021;
originally announced September 2021.
-
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Authors:
Yi Tay,
Mostafa Dehghani,
**feng Rao,
William Fedus,
Samira Abnar,
Hyung Won Chung,
Sharan Narang,
Dani Yogatama,
Ashish Vaswani,
Donald Metzler
Abstract:
There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presen…
▽ More
There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.
△ Less
Submitted 30 January, 2022; v1 submitted 22 September, 2021;
originally announced September 2021.
-
Revisiting ResNets: Improved Training and Scaling Strategies
Authors:
Irwan Bello,
William Fedus,
Xianzhi Du,
Ekin D. Cubuk,
Aravind Srinivas,
Tsung-Yi Lin,
Jonathon Shlens,
Barret Zoph
Abstract:
Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al., 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter mor…
▽ More
Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al., 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter more than architectural changes, and further, that the resulting ResNets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended (Tan & Le, 2019). Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet. In a large-scale semi-supervised learning setup, ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7x faster than EfficientNet NoisyStudent. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. We recommend practitioners use these simple revised ResNets as baselines for future research.
△ Less
Submitted 12 March, 2021;
originally announced March 2021.
-
Do Transformer Modifications Transfer Across Implementations and Applications?
Authors:
Sharan Narang,
Hyung Won Chung,
Yi Tay,
William Fedus,
Thibault Fevry,
Michael Matena,
Karishma Malkan,
Noah Fiedel,
Noam Shazeer,
Zhenzhong Lan,
Yanqi Zhou,
Wei Li,
Nan Ding,
Jake Marcus,
Adam Roberts,
Colin Raffel
Abstract:
The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we f…
▽ More
The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we find that most modifications do not meaningfully improve performance. Furthermore, most of the Transformer variants we found beneficial were either developed in the same codebase that we used or are relatively minor changes. We conjecture that performance improvements may strongly depend on implementation details and correspondingly make some recommendations for improving the generality of experimental results.
△ Less
Submitted 10 September, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Authors:
William Fedus,
Barret Zoph,
Noam Shazeer
Abstract:
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by comple…
▽ More
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of language models by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.
△ Less
Submitted 16 June, 2022; v1 submitted 11 January, 2021;
originally announced January 2021.
-
Revisiting Fundamentals of Experience Replay
Authors:
William Fedus,
Prajit Ramachandran,
Rishabh Agarwal,
Yoshua Bengio,
Hugo Larochelle,
Mark Rowland,
Will Dabney
Abstract:
Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and a…
▽ More
Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and ablative studies upend conventional wisdom around experience replay -- greater capacity is found to substantially increase the performance of certain algorithms, while leaving others unaffected. Counterintuitively we show that theoretically ungrounded, uncorrected n-step returns are uniquely beneficial while other techniques confer limited benefit for sifting through larger memory. Separately, by directly controlling the replay ratio we contextualize previous observations in the literature and empirically measure its importance across a variety of deep RL algorithms. Finally, we conclude by testing a set of hypotheses on the nature of these performance benefits.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
On Catastrophic Interference in Atari 2600 Games
Authors:
William Fedus,
Dibya Ghosh,
John D. Martin,
Marc G. Bellemare,
Yoshua Bengio,
Hugo Larochelle
Abstract:
Model-free deep reinforcement learning is sample inefficient. One hypothesis -- speculated, but not confirmed -- is that catastrophic interference within an environment inhibits learning. We test this hypothesis through a large-scale empirical study in the Arcade Learning Environment (ALE) and, indeed, find supporting evidence. We show that interference causes performance to plateau; the network c…
▽ More
Model-free deep reinforcement learning is sample inefficient. One hypothesis -- speculated, but not confirmed -- is that catastrophic interference within an environment inhibits learning. We test this hypothesis through a large-scale empirical study in the Arcade Learning Environment (ALE) and, indeed, find supporting evidence. We show that interference causes performance to plateau; the network cannot train on segments beyond the plateau without degrading the policy used to reach there. By synthetically controlling for interference, we demonstrate performance boosts across architectures, learning algorithms and environments. A more refined analysis shows that learning one segment of a game often increases prediction errors elsewhere. Our study provides a clear empirical link between catastrophic interference and sample efficiency in reinforcement learning.
△ Less
Submitted 9 June, 2020; v1 submitted 27 February, 2020;
originally announced February 2020.
-
Algorithmic Improvements for Deep Reinforcement Learning applied to Interactive Fiction
Authors:
Vishal Jain,
William Fedus,
Hugo Larochelle,
Doina Precup,
Marc G. Bellemare
Abstract:
Text-based games are a natural challenge domain for deep reinforcement learning algorithms. Their state and action spaces are combinatorially large, their reward function is sparse, and they are partially observable: the agent is informed of the consequences of its actions through textual feedback. In this paper we emphasize this latter point and consider the design of a deep reinforcement learnin…
▽ More
Text-based games are a natural challenge domain for deep reinforcement learning algorithms. Their state and action spaces are combinatorially large, their reward function is sparse, and they are partially observable: the agent is informed of the consequences of its actions through textual feedback. In this paper we emphasize this latter point and consider the design of a deep reinforcement learning agent that can play from feedback alone. Our design recognizes and takes advantage of the structural characteristics of text-based games. We first propose a contextualisation mechanism, based on accumulated reward, which simplifies the learning problem and mitigates partial observability. We then study different methods that rely on the notion that most actions are ineffectual in any given situation, following Zahavy et al.'s idea of an admissible action. We evaluate these techniques in a series of text-based games of increasing difficulty based on the TextWorld framework, as well as the iconic game Zork. Empirically, we find that these techniques improve the performance of a baseline deep reinforcement learning agent applied to text-based games.
△ Less
Submitted 27 November, 2019;
originally announced November 2019.
-
Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment
Authors:
Adrien Ali Taïga,
William Fedus,
Marlos C. Machado,
Aaron Courville,
Marc G. Bellemare
Abstract:
This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the s…
▽ More
This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the state-of-the-art algorithm for value-based agents, and focus on some of the bonuses proposed in the last few years. We consider the impact these algorithms have on performance within the popular game Montezuma's Revenge which has gathered a lot of interest from the exploration community, across the the set of seven games identified by Bellemare et al. (2016) as challenging for exploration, and easier games where exploration is not an issue. We find that, in our setting, recently developed bonuses do not provide significantly improved performance on Montezuma's Revenge or hard exploration games. We also find that existing bonus-based methods may negatively impact performance on games in which exploration is not an issue and may even perform worse than $ε$-greedy exploration.
△ Less
Submitted 24 September, 2021; v1 submitted 6 August, 2019;
originally announced August 2019.
-
Hyperbolic Discounting and Learning over Multiple Horizons
Authors:
William Fedus,
Carles Gelada,
Yoshua Bengio,
Marc G. Bellemare,
Hugo Larochelle
Abstract:
Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we re…
▽ More
Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.
△ Less
Submitted 28 February, 2019; v1 submitted 18 February, 2019;
originally announced February 2019.
-
Language GANs Falling Short
Authors:
Massimo Caccia,
Lucas Caccia,
William Fedus,
Hugo Larochelle,
Joelle Pineau,
Laurent Charlin
Abstract:
Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have consistently been reported as weak baselines, where poor performance is attributed to exposure bias (Bengio et al., 2015; Ranzato et al., 2015); at inference time, the model is fed its own prediction inste…
▽ More
Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have consistently been reported as weak baselines, where poor performance is attributed to exposure bias (Bengio et al., 2015; Ranzato et al., 2015); at inference time, the model is fed its own prediction instead of a ground-truth token, which can lead to accumulating errors and poor samples. This line of reasoning has led to an outbreak of adversarial based approaches for NLG, on the account that GANs do not suffer from exposure bias. In this work, we make several surprising observations which contradict common beliefs. First, we revisit the canonical evaluation framework for NLG, and point out fundamental flaws with quality-only evaluation: we show that one can outperform such metrics using a simple, well-known temperature parameter to artificially reduce the entropy of the model's conditional distributions. Second, we leverage the control over the quality / diversity trade-off given by this parameter to evaluate models over the whole quality-diversity spectrum and find MLE models constantly outperform the proposed GAN variants over the whole quality-diversity space. Our results have several implications: 1) The impact of exposure bias on sample quality is less severe than previously thought, 2) temperature tuning provides a better quality / diversity trade-off than adversarial training while being easier to train, easier to cross-validate, and less computationally expensive. Code to reproduce the experiments is available at github.com/pclucas14/GansFallingShort
△ Less
Submitted 19 February, 2020; v1 submitted 6 November, 2018;
originally announced November 2018.
-
Deep Graph Infomax
Authors:
Petar Veličković,
William Fedus,
William L. Hamilton,
Pietro Liò,
Yoshua Bengio,
R Devon Hjelm
Abstract:
We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs ce…
▽ More
We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.
△ Less
Submitted 21 December, 2018; v1 submitted 27 September, 2018;
originally announced September 2018.
-
Recall Traces: Backtracking Models for Efficient Reinforcement Learning
Authors:
Anirudh Goyal,
Philemon Brakel,
William Fedus,
Soumye Singhal,
Timothy Lillicrap,
Sergey Levine,
Hugo Larochelle,
Yoshua Bengio
Abstract:
In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate a…
▽ More
In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate at a given high-reward state. We can train a model which, starting from a high value state (or one that is estimated to have high value), predicts and sample for which the (state, action)-tuples may have led to that high value state. These traces of (state, action) pairs, which we refer to as Recall Traces, sampled from this backtracking model starting from a high value state, are informative as they terminate in good states, and hence we can use these traces to improve a policy. We provide a variational interpretation for this idea and a practical algorithm in which the backtracking model samples from an approximate posterior distribution over trajectories which lead to large rewards. Our method improves the sample efficiency of both on- and off-policy RL algorithms across several environments and tasks.
△ Less
Submitted 28 January, 2019; v1 submitted 1 April, 2018;
originally announced April 2018.
-
Disentangling the independently controllable factors of variation by interacting with the world
Authors:
Valentin Thomas,
Emmanuel Bengio,
William Fedus,
Jules Pondard,
Philippe Beaudoin,
Hugo Larochelle,
Joelle Pineau,
Doina Precup,
Yoshua Bengio
Abstract:
It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to int…
▽ More
It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to interact with its environment. The agent can experiment with different actions and observe their effects. More specifically, we hypothesize that some of these factors correspond to aspects of the environment which are independently controllable, i.e., that there exists a policy and a learnable feature for each such aspect of the environment, such that this policy can yield changes in that feature with minimal changes to other features that explain the statistical variations in the observed data. We propose a specific objective function to find such factors, and verify experimentally that it can indeed disentangle independently controllable aspects of the environment without any extrinsic reward signal.
△ Less
Submitted 26 February, 2018;
originally announced February 2018.
-
MaskGAN: Better Text Generation via Filling in the______
Authors:
William Fedus,
Ian Goodfellow,
Andrew M. Dai
Abstract:
Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality…
▽ More
Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality of the generated text. Additionally, these models are typically trained via maxi- mum likelihood and teacher forcing. These methods are well-suited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We claim that validation perplexity alone is not indicative of the quality of text generated by a model. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.
△ Less
Submitted 1 March, 2018; v1 submitted 23 January, 2018;
originally announced January 2018.
-
Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step
Authors:
William Fedus,
Mihaela Rosca,
Balaji Lakshminarayanan,
Andrew M. Dai,
Shakir Mohamed,
Ian Goodfellow
Abstract:
Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each playe…
▽ More
Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each player cannot reduce their cost without changing the other players' parameters. One useful approach for the theory of GANs is to show that a divergence between the training distribution and the model distribution obtains its minimum value at equilibrium. Several recent research directions have been motivated by the idea that this divergence is the primary guide for the learning process and that every step of learning should decrease the divergence. We show that this view is overly restrictive. During GAN training, the discriminator provides learning signal in situations where the gradients of the divergences between distributions would not be useful. We provide empirical counterexamples to the view of GAN training as divergence minimization. Specifically, we demonstrate that GANs are able to learn distributions in situations where the divergence minimization point of view predicts they would fail. We also show that gradient penalties motivated from the divergence minimization perspective are equally helpful when applied in other contexts in which the divergence minimization perspective does not predict they would be helpful. This contributes to a growing body of evidence that GAN training may be more usefully viewed as approaching Nash equilibria via trajectories that do not necessarily minimize a specific divergence at each step.
△ Less
Submitted 20 February, 2018; v1 submitted 23 October, 2017;
originally announced October 2017.
-
Background Rejection in the DMTPC Dark Matter Search Using Charge Signals
Authors:
J. P. Lopez,
S. Ahlen,
J. Battat,
T. Caldwell,
M. Chernicoff,
C. Deaconu,
D. Dujmic,
A. Dushkin,
W. Fedus,
P. Fisher,
F. Golub,
S. Henderson,
A. Inglis,
A. Kaboth,
G. Kohse,
L. Kirsch,
R. Lanza,
A. Lee,
J. Monroe,
H. Ouyang,
T. Sahin,
G. Sciolla,
N. Skvorodnev,
H. Tomita,
H. Wellenstein
, et al. (3 additional authors not shown)
Abstract:
The Dark Matter Time Projection Chamber (DMTPC) collaboration is develo** low-pressure gas TPC detectors for measuring WIMP-nucleon interactions. Optical readout with CCD cameras allows for the detection for the daily modulation in the direction of the dark matter wind, while several charge readout channels allow for the measurement of additional recoil properties. In this article, we show that…
▽ More
The Dark Matter Time Projection Chamber (DMTPC) collaboration is develo** low-pressure gas TPC detectors for measuring WIMP-nucleon interactions. Optical readout with CCD cameras allows for the detection for the daily modulation in the direction of the dark matter wind, while several charge readout channels allow for the measurement of additional recoil properties. In this article, we show that the addition of the charge readout analysis to the CCD allows us too obtain a statistics-limited 90% C.L. upper limit on the $e^-$ rejection factor of $5.6\times10^{-6}$ for recoils with energies between 40 and 200 keV$_{\mathrm{ee}}$. In addition, requiring coincidence between charge signals and light in the CCD reduces CCD-specific backgrounds by more than two orders of magnitude.
△ Less
Submitted 15 September, 2011;
originally announced September 2011.
-
DMTPC: Dark matter detection with directional sensitivity
Authors:
J. B. R. Battat,
S. Ahlen,
T. Caldwell,
C. Deaconu,
D. Dujmic,
W. Fedus,
P. Fisher,
F. Golub,
S. Henderson,
A. Inglis,
A. Kaboth,
G. Kohse,
R. Lanza,
A. Lee,
J. Lopez,
J. Monroe,
T. Sahin,
G. Sciolla,
N. Skvorodnev,
H. Tomita,
H. Wellenstein,
I. Wolfe,
R. Yamamoto,
H. Yegoryan
Abstract:
The Dark Matter Time Projection Chamber (DMTPC) experiment uses CF_4 gas at low pressure (0.1 atm) to search for the directional signature of Galactic WIMP dark matter. We describe the DMTPC apparatus and summarize recent results from a 35.7 g-day exposure surface run at MIT. After nuclear recoil cuts are applied to the data, we find 105 candidate events in the energy range 80 - 200 keV, which is…
▽ More
The Dark Matter Time Projection Chamber (DMTPC) experiment uses CF_4 gas at low pressure (0.1 atm) to search for the directional signature of Galactic WIMP dark matter. We describe the DMTPC apparatus and summarize recent results from a 35.7 g-day exposure surface run at MIT. After nuclear recoil cuts are applied to the data, we find 105 candidate events in the energy range 80 - 200 keV, which is consistent with the expected cosmogenic neutron background. Using this data, we obtain a limit on the spin-dependent WIMP-proton cross-section of 2.0 \times 10^{-33} cm^2 at a WIMP mass of 115 GeV/c^2. This detector is currently deployed underground at the Waste Isolation Pilot Plant in New Mexico.
△ Less
Submitted 17 December, 2010;
originally announced December 2010.
-
First Dark Matter Search Results from a Surface Run of the 10-L DMTPC Directional Dark Matter Detector
Authors:
S. Ahlen,
J. B. R. Battat,
T. Caldwell,
C. Deaconu,
D. Dujmic,
W. Fedus,
P. Fisher,
F. Golub,
S. Henderson,
A. Inglis,
A. Kaboth,
G. Kohse,
R. Lanza,
A. Lee,
J. Lopez,
J. Monroe,
T. Sahin,
G. Sciolla,
N. Skvorodnev,
H. Tomita,
H. Wellenstein,
I. Wolfe,
R. Yamamoto,
H. Yegoryan
Abstract:
The Dark Matter Time Projection Chamber (DMTPC) is a low pressure (75 Torr CF4) 10 liter detector capable of measuring the vector direction of nuclear recoils with the goal of directional dark matter detection. In this paper we present the first dark matter limit from DMTPC. In an analysis window of 80-200 keV recoil energy, based on a 35.7 g-day exposure, we set a 90% C.L. upper limit on the spin…
▽ More
The Dark Matter Time Projection Chamber (DMTPC) is a low pressure (75 Torr CF4) 10 liter detector capable of measuring the vector direction of nuclear recoils with the goal of directional dark matter detection. In this paper we present the first dark matter limit from DMTPC. In an analysis window of 80-200 keV recoil energy, based on a 35.7 g-day exposure, we set a 90% C.L. upper limit on the spin-dependent WIMP-proton cross section of 2.0 x 10^{-33} cm^{2} for 115 GeV/c^2 dark matter particle mass.
△ Less
Submitted 9 December, 2010; v1 submitted 15 June, 2010;
originally announced June 2010.
-
The case for a directional dark matter detector and the status of current experimental efforts
Authors:
S. Ahlen,
N. Afshordi,
J. B. R. Battat,
J. Billard,
N. Bozorgnia,
S. Burgos,
T. Caldwell,
J. M. Carmona,
S. Cebrian,
P. Colas,
T. Dafni,
E. Daw,
D. Dujmic,
A. Dushkin,
W. Fedus,
E. Ferrer,
D. Finkbeiner,
P. H. Fisher,
J. Forbes,
T. Fusayasu,
J. Galan,
T. Gamble,
C. Ghag,
I. Giomataris,
M. Gold
, et al. (87 additional authors not shown)
Abstract:
We present the case for a dark matter detector with directional sensitivity. This document was developed at the 2009 CYGNUS workshop on directional dark matter detection, and contains contributions from theorists and experimental groups in the field. We describe the need for a dark matter detector with directional sensitivity; each directional dark matter experiment presents their project's stat…
▽ More
We present the case for a dark matter detector with directional sensitivity. This document was developed at the 2009 CYGNUS workshop on directional dark matter detection, and contains contributions from theorists and experimental groups in the field. We describe the need for a dark matter detector with directional sensitivity; each directional dark matter experiment presents their project's status; and we close with a feasibility study for scaling up to a one ton directional detector, which would cost around $150M.
△ Less
Submitted 1 November, 2009;
originally announced November 2009.