Skip to main content

Showing 1–27 of 27 results for author: Fedus, W

.
  1. arXiv:2305.14705  [pdf, other

    cs.CL

    Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models

    Authors: Sheng Shen, Le Hou, Yanqi Zhou, Nan Du, Shayne Longpre, Jason Wei, Hyung Won Chung, Barret Zoph, William Fedus, Xinyun Chen, Tu Vu, Yuexin Wu, Wuyang Chen, Albert Webson, Yunxuan Li, Vincent Zhao, Hongkun Yu, Kurt Keutzer, Trevor Darrell, Denny Zhou

    Abstract: Sparse Mixture-of-Experts (MoE) is a neural architecture design that can be utilized to add learnable parameters to Large Language Models (LLMs) without increasing inference cost. Instruction tuning is a technique for training LLMs to follow instructions. We advocate combining these two approaches, as we find that MoE models benefit more from instruction tuning than dense models. In particular, we… ▽ More

    Submitted 5 July, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

    Comments: Preprint

  2. arXiv:2210.11416  [pdf, other

    cs.LG cs.CL

    Scaling Instruction-Finetuned Language Models

    Authors: Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yan** Huang , et al. (10 additional authors not shown)

    Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d… ▽ More

    Submitted 6 December, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: Public checkpoints: https://huggingface.co/docs/transformers/model_doc/flan-t5

  3. arXiv:2209.01667  [pdf, other

    cs.LG cs.CL

    A Review of Sparse Expert Models in Deep Learning

    Authors: William Fedus, Jeff Dean, Barret Zoph

    Abstract: Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute… ▽ More

    Submitted 4 September, 2022; originally announced September 2022.

    Comments: 23 pages

  4. arXiv:2207.10551  [pdf, other

    cs.LG cs.CL

    Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

    Authors: Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, **feng Rao, Sharan Narang, Vinh Q. Tran, Dani Yogatama, Donald Metzler

    Abstract: There have been a lot of interest in the scaling properties of Transformer models. However, not much has been done on the front of investigating the effect of scaling properties of different inductive biases and model architectures. Do model architectures scale differently? If so, how does inductive bias affect scaling behaviour? How does this influence upstream (pretraining) and downstream (trans… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

  5. arXiv:2206.07682  [pdf, other

    cs.CL

    Emergent Abilities of Large Language Models

    Authors: Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, William Fedus

    Abstract: Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot… ▽ More

    Submitted 26 October, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: Transactions on Machine Learning Research (TMLR), 2022

  6. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  7. arXiv:2202.08906  [pdf, other

    cs.CL cs.LG

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    Authors: Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan** Huang, Jeff Dean, Noam Shazeer, William Fedus

    Abstract: Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine… ▽ More

    Submitted 29 April, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

    Comments: 25 pages main text, 39 pages overall

  8. arXiv:2109.11052  [pdf, other

    cs.LG

    On Bonus-Based Exploration Methods in the Arcade Learning Environment

    Authors: Adrien Ali Taïga, William Fedus, Marlos C. Machado, Aaron Courville, Marc G. Bellemare

    Abstract: Research on exploration in reinforcement learning, as applied to Atari 2600 game-playing, has emphasized tackling difficult exploration problems such as Montezuma's Revenge (Bellemare et al., 2016). Recently, bonus-based exploration methods, which explore by augmenting the environment reward, have reached above-human average performance on such domains. In this paper we reassess popular bonus-base… ▽ More

    Submitted 22 September, 2021; originally announced September 2021.

    Comments: Full version of arXiv:1908.02388

    Journal ref: Published as a conference paper at ICLR 2020

  9. arXiv:2109.10686  [pdf, other

    cs.CL cs.AI cs.CV cs.LG

    Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

    Authors: Yi Tay, Mostafa Dehghani, **feng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler

    Abstract: There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presen… ▽ More

    Submitted 30 January, 2022; v1 submitted 22 September, 2021; originally announced September 2021.

    Comments: ICLR 2022 + Updated Checkpoint Release

  10. arXiv:2103.07579  [pdf, other

    cs.CV

    Revisiting ResNets: Improved Training and Scaling Strategies

    Authors: Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph

    Abstract: Novel computer vision architectures monopolize the spotlight, but the impact of the model architecture is often conflated with simultaneous changes to training methodology and scaling strategies. Our work revisits the canonical ResNet (He et al., 2015) and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter mor… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

  11. arXiv:2102.11972  [pdf, other

    cs.LG cs.CL

    Do Transformer Modifications Transfer Across Implementations and Applications?

    Authors: Sharan Narang, Hyung Won Chung, Yi Tay, William Fedus, Thibault Fevry, Michael Matena, Karishma Malkan, Noah Fiedel, Noam Shazeer, Zhenzhong Lan, Yanqi Zhou, Wei Li, Nan Ding, Jake Marcus, Adam Roberts, Colin Raffel

    Abstract: The research community has proposed copious modifications to the Transformer architecture since it was introduced over three years ago, relatively few of which have seen widespread adoption. In this paper, we comprehensively evaluate many of these modifications in a shared experimental setting that covers most of the common uses of the Transformer in natural language processing. Surprisingly, we f… ▽ More

    Submitted 10 September, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

    Comments: To appear at EMNLP 2021 as a conference paper

  12. arXiv:2101.03961  [pdf, other

    cs.LG cs.AI

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    Authors: William Fedus, Barret Zoph, Noam Shazeer

    Abstract: In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by comple… ▽ More

    Submitted 16 June, 2022; v1 submitted 11 January, 2021; originally announced January 2021.

    Comments: JMLR

  13. arXiv:2007.06700  [pdf, other

    cs.LG stat.ML

    Revisiting Fundamentals of Experience Replay

    Authors: William Fedus, Prajit Ramachandran, Rishabh Agarwal, Yoshua Bengio, Hugo Larochelle, Mark Rowland, Will Dabney

    Abstract: Experience replay is central to off-policy algorithms in deep reinforcement learning (RL), but there remain significant gaps in our understanding. We therefore present a systematic and extensive analysis of experience replay in Q-learning methods, focusing on two fundamental properties: the replay capacity and the ratio of learning updates to experience collected (replay ratio). Our additive and a… ▽ More

    Submitted 13 July, 2020; originally announced July 2020.

    Comments: Published at ICML 2020. First two authors contributed equally and code available at https://github.com/google-research/google-research/tree/master/experience_replay

  14. arXiv:2002.12499  [pdf, other

    cs.LG cs.AI stat.ML

    On Catastrophic Interference in Atari 2600 Games

    Authors: William Fedus, Dibya Ghosh, John D. Martin, Marc G. Bellemare, Yoshua Bengio, Hugo Larochelle

    Abstract: Model-free deep reinforcement learning is sample inefficient. One hypothesis -- speculated, but not confirmed -- is that catastrophic interference within an environment inhibits learning. We test this hypothesis through a large-scale empirical study in the Arcade Learning Environment (ALE) and, indeed, find supporting evidence. We show that interference causes performance to plateau; the network c… ▽ More

    Submitted 9 June, 2020; v1 submitted 27 February, 2020; originally announced February 2020.

    Comments: First two authors contributed equally. Code available to reproduce experiments at https://github.com/google-research/google-research/tree/master/memento

  15. arXiv:1911.12511  [pdf, other

    cs.AI cs.LG

    Algorithmic Improvements for Deep Reinforcement Learning applied to Interactive Fiction

    Authors: Vishal Jain, William Fedus, Hugo Larochelle, Doina Precup, Marc G. Bellemare

    Abstract: Text-based games are a natural challenge domain for deep reinforcement learning algorithms. Their state and action spaces are combinatorially large, their reward function is sparse, and they are partially observable: the agent is informed of the consequences of its actions through textual feedback. In this paper we emphasize this latter point and consider the design of a deep reinforcement learnin… ▽ More

    Submitted 27 November, 2019; originally announced November 2019.

    Comments: To appear in Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20). Accepted for Oral presentation

  16. arXiv:1908.02388  [pdf, other

    cs.LG stat.ML

    Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment

    Authors: Adrien Ali Taïga, William Fedus, Marlos C. Machado, Aaron Courville, Marc G. Bellemare

    Abstract: This paper provides an empirical evaluation of recently developed exploration algorithms within the Arcade Learning Environment (ALE). We study the use of different reward bonuses that incentives exploration in reinforcement learning. We do so by fixing the learning algorithm used and focusing only on the impact of the different exploration bonuses in the agent's performance. We use Rainbow, the s… ▽ More

    Submitted 24 September, 2021; v1 submitted 6 August, 2019; originally announced August 2019.

    Comments: Accepted at the second Exploration in Reinforcement Learning Workshop at the 36th International Conference on Machine Learning, Long Beach, California. The full version arxiv.longhoe.net/abs/2109.11052 was published as a conference paper at ICLR 2020

  17. arXiv:1902.06865  [pdf, other

    stat.ML cs.LG

    Hyperbolic Discounting and Learning over Multiple Horizons

    Authors: William Fedus, Carles Gelada, Yoshua Bengio, Marc G. Bellemare, Hugo Larochelle

    Abstract: Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we re… ▽ More

    Submitted 28 February, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

  18. arXiv:1811.02549  [pdf, other

    cs.CL cs.LG

    Language GANs Falling Short

    Authors: Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, Laurent Charlin

    Abstract: Generating high-quality text with sufficient diversity is essential for a wide range of Natural Language Generation (NLG) tasks. Maximum-Likelihood (MLE) models trained with teacher forcing have consistently been reported as weak baselines, where poor performance is attributed to exposure bias (Bengio et al., 2015; Ranzato et al., 2015); at inference time, the model is fed its own prediction inste… ▽ More

    Submitted 19 February, 2020; v1 submitted 6 November, 2018; originally announced November 2018.

    Journal ref: ICLR 2020 - Proceedings of the Seventh International Conference on Learning Representation

  19. arXiv:1809.10341  [pdf, other

    stat.ML cs.IT cs.LG cs.SI

    Deep Graph Infomax

    Authors: Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, R Devon Hjelm

    Abstract: We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs ce… ▽ More

    Submitted 21 December, 2018; v1 submitted 27 September, 2018; originally announced September 2018.

    Comments: To appear at ICLR 2019. 17 pages, 8 figures

  20. arXiv:1804.00379  [pdf, other

    cs.LG stat.ML

    Recall Traces: Backtracking Models for Efficient Reinforcement Learning

    Authors: Anirudh Goyal, Philemon Brakel, William Fedus, Soumye Singhal, Timothy Lillicrap, Sergey Levine, Hugo Larochelle, Yoshua Bengio

    Abstract: In many environments only a tiny subset of all states yield high reward. In these cases, few of the interactions with the environment provide a relevant learning signal. Hence, we may want to preferentially train on those high-reward states and the probable trajectories leading to them. To this end, we advocate for the use of a backtracking model that predicts the preceding states that terminate a… ▽ More

    Submitted 28 January, 2019; v1 submitted 1 April, 2018; originally announced April 2018.

    Comments: Accepted at ICLR 2019

  21. arXiv:1802.09484  [pdf, other

    stat.ML cs.LG

    Disentangling the independently controllable factors of variation by interacting with the world

    Authors: Valentin Thomas, Emmanuel Bengio, William Fedus, Jules Pondard, Philippe Beaudoin, Hugo Larochelle, Joelle Pineau, Doina Precup, Yoshua Bengio

    Abstract: It has been postulated that a good representation is one that disentangles the underlying explanatory factors of variation. However, it remains an open question what kind of training framework could potentially achieve that. Whereas most previous work focuses on the static setting (e.g., with images), we postulate that some of the causal factors could be discovered if the learner is allowed to int… ▽ More

    Submitted 26 February, 2018; originally announced February 2018.

    Comments: Presented at NIPS 2017 Learning Disentangling Representations Workshop

  22. arXiv:1801.07736  [pdf, other

    stat.ML cs.AI cs.LG

    MaskGAN: Better Text Generation via Filling in the______

    Authors: William Fedus, Ian Goodfellow, Andrew M. Dai

    Abstract: Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality… ▽ More

    Submitted 1 March, 2018; v1 submitted 23 January, 2018; originally announced January 2018.

    Comments: 16 pages, ICLR 2018

  23. arXiv:1710.08446  [pdf, other

    stat.ML cs.LG

    Many Paths to Equilibrium: GANs Do Not Need to Decrease a Divergence At Every Step

    Authors: William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, Ian Goodfellow

    Abstract: Generative adversarial networks (GANs) are a family of generative models that do not minimize a single training criterion. Unlike other generative models, the data distribution is learned via a game between a generator (the generative model) and a discriminator (a teacher providing training signal) that each minimize their own cost. GANs are designed to reach a Nash equilibrium at which each playe… ▽ More

    Submitted 20 February, 2018; v1 submitted 23 October, 2017; originally announced October 2017.

    Comments: 18 pages

  24. arXiv:1109.3501  [pdf, ps, other

    astro-ph.IM hep-ex physics.ins-det

    Background Rejection in the DMTPC Dark Matter Search Using Charge Signals

    Authors: J. P. Lopez, S. Ahlen, J. Battat, T. Caldwell, M. Chernicoff, C. Deaconu, D. Dujmic, A. Dushkin, W. Fedus, P. Fisher, F. Golub, S. Henderson, A. Inglis, A. Kaboth, G. Kohse, L. Kirsch, R. Lanza, A. Lee, J. Monroe, H. Ouyang, T. Sahin, G. Sciolla, N. Skvorodnev, H. Tomita, H. Wellenstein , et al. (3 additional authors not shown)

    Abstract: The Dark Matter Time Projection Chamber (DMTPC) collaboration is develo** low-pressure gas TPC detectors for measuring WIMP-nucleon interactions. Optical readout with CCD cameras allows for the detection for the daily modulation in the direction of the dark matter wind, while several charge readout channels allow for the measurement of additional recoil properties. In this article, we show that… ▽ More

    Submitted 15 September, 2011; originally announced September 2011.

    Comments: 8 pages, 6 figures. For proceedings of DPF 2011 conference

  25. arXiv:1012.3912  [pdf, other

    astro-ph.IM astro-ph.CO

    DMTPC: Dark matter detection with directional sensitivity

    Authors: J. B. R. Battat, S. Ahlen, T. Caldwell, C. Deaconu, D. Dujmic, W. Fedus, P. Fisher, F. Golub, S. Henderson, A. Inglis, A. Kaboth, G. Kohse, R. Lanza, A. Lee, J. Lopez, J. Monroe, T. Sahin, G. Sciolla, N. Skvorodnev, H. Tomita, H. Wellenstein, I. Wolfe, R. Yamamoto, H. Yegoryan

    Abstract: The Dark Matter Time Projection Chamber (DMTPC) experiment uses CF_4 gas at low pressure (0.1 atm) to search for the directional signature of Galactic WIMP dark matter. We describe the DMTPC apparatus and summarize recent results from a 35.7 g-day exposure surface run at MIT. After nuclear recoil cuts are applied to the data, we find 105 candidate events in the energy range 80 - 200 keV, which is… ▽ More

    Submitted 17 December, 2010; originally announced December 2010.

    Comments: Conference proceedings from the Identification of Dark Matter 2010, Montpellier, France. To be published by SISSA as PoS(IDM2010)042. 7 pages, 6 figures

  26. First Dark Matter Search Results from a Surface Run of the 10-L DMTPC Directional Dark Matter Detector

    Authors: S. Ahlen, J. B. R. Battat, T. Caldwell, C. Deaconu, D. Dujmic, W. Fedus, P. Fisher, F. Golub, S. Henderson, A. Inglis, A. Kaboth, G. Kohse, R. Lanza, A. Lee, J. Lopez, J. Monroe, T. Sahin, G. Sciolla, N. Skvorodnev, H. Tomita, H. Wellenstein, I. Wolfe, R. Yamamoto, H. Yegoryan

    Abstract: The Dark Matter Time Projection Chamber (DMTPC) is a low pressure (75 Torr CF4) 10 liter detector capable of measuring the vector direction of nuclear recoils with the goal of directional dark matter detection. In this paper we present the first dark matter limit from DMTPC. In an analysis window of 80-200 keV recoil energy, based on a 35.7 g-day exposure, we set a 90% C.L. upper limit on the spin… ▽ More

    Submitted 9 December, 2010; v1 submitted 15 June, 2010; originally announced June 2010.

    Comments: accepted for publication in Physics Letters B

    Journal ref: Phys.Lett.B695:124-129,2011

  27. The case for a directional dark matter detector and the status of current experimental efforts

    Authors: S. Ahlen, N. Afshordi, J. B. R. Battat, J. Billard, N. Bozorgnia, S. Burgos, T. Caldwell, J. M. Carmona, S. Cebrian, P. Colas, T. Dafni, E. Daw, D. Dujmic, A. Dushkin, W. Fedus, E. Ferrer, D. Finkbeiner, P. H. Fisher, J. Forbes, T. Fusayasu, J. Galan, T. Gamble, C. Ghag, I. Giomataris, M. Gold , et al. (87 additional authors not shown)

    Abstract: We present the case for a dark matter detector with directional sensitivity. This document was developed at the 2009 CYGNUS workshop on directional dark matter detection, and contains contributions from theorists and experimental groups in the field. We describe the need for a dark matter detector with directional sensitivity; each directional dark matter experiment presents their project's stat… ▽ More

    Submitted 1 November, 2009; originally announced November 2009.

    Comments: 48 pages, 37 figures, whitepaper on direct dark matter detection with directional sensitivity

    Journal ref: Int.J.Mod.Phys.A25:1-51,2010