Skip to main content

Showing 1–50 of 123 results for author: Le, Q V

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.04520  [pdf, other

    cs.CL cs.AI

    NATURAL PLAN: Benchmarking LLMs on Natural Language Planning

    Authors: Huaixiu Steven Zheng, Swaroop Mishra, Hugh Zhang, Xinyun Chen, Minmin Chen, Azade Nova, Le Hou, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou

    Abstract: We introduce NATURAL PLAN, a realistic planning benchmark in natural language containing 3 key tasks: Trip Planning, Meeting Planning, and Calendar Scheduling. We focus our evaluation on the planning capabilities of LLMs with full information on the task, by providing outputs from tools such as Google Flights, Google Maps, and Google Calendar as contexts to the models. This eliminates the need for… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  2. arXiv:2403.18802  [pdf, other

    cs.CL cs.AI cs.LG

    Long-form factuality in large language models

    Authors: Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

    Abstract: Large language models (LLMs) often generate content that contains factual errors when responding to fact-seeking prompts on open-ended topics. To benchmark a model's long-form factuality in open domains, we first use GPT-4 to generate LongFact, a prompt set comprising thousands of questions spanning 38 topics. We then propose that LLM agents can be used as automated evaluators for long-form factua… ▽ More

    Submitted 3 April, 2024; v1 submitted 27 March, 2024; originally announced March 2024.

  3. arXiv:2402.03620  [pdf, other

    cs.AI cs.CL

    Self-Discover: Large Language Models Self-Compose Reasoning Structures

    Authors: Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Huaixiu Steven Zheng

    Abstract: We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasonin… ▽ More

    Submitted 5 February, 2024; originally announced February 2024.

    Comments: 17 pages, 11 figures, 5 tables

  4. arXiv:2312.08472  [pdf, other

    cs.NE cs.LG math.NA

    AutoNumerics-Zero: Automated Discovery of State-of-the-Art Mathematical Functions

    Authors: Esteban Real, Yao Chen, Mirko Rossini, Connal de Souza, Manav Garg, Akhil Verghese, Moritz Firsching, Quoc V. Le, Ekin Dogus Cubuk, David H. Park

    Abstract: Computers calculate transcendental functions by approximating them through the composition of a few limited-precision instructions. For example, an exponential can be calculated with a Taylor series. These approximation methods were developed over the centuries by mathematicians, who emphasized the attainability of arbitrary precision. Computers, however, operate on few limited precision types, su… ▽ More

    Submitted 13 December, 2023; originally announced December 2023.

    ACM Class: I.2.2; I.2.6; G.1.2

  5. arXiv:2310.06117  [pdf, other

    cs.LG cs.AI cs.CL

    Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

    Authors: Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V Le, Denny Zhou

    Abstract: We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide reasoning, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of Step-Back Prompting with… ▽ More

    Submitted 12 March, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

    Comments: ICLR 2024

  6. arXiv:2309.03409  [pdf, other

    cs.LG cs.AI cs.CL

    Large Language Models as Optimizers

    Authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen

    Abstract: Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In eac… ▽ More

    Submitted 15 April, 2024; v1 submitted 6 September, 2023; originally announced September 2023.

    Comments: ICLR 2024; 42 pages, 26 figures, 15 tables. Code at https://github.com/google-deepmind/opro

  7. arXiv:2308.03958  [pdf, other

    cs.CL

    Simple synthetic data reduces sycophancy in large language models

    Authors: Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, Quoc V. Le

    Abstract: Sycophancy is an undesirable behavior where models tailor their responses to follow a human user's view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior. First, on a set of three sy… ▽ More

    Submitted 14 February, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

  8. arXiv:2308.03290  [pdf, other

    cs.CV cs.LG

    FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search

    Authors: Jordan Dotzel, Gang Wu, Andrew Li, Muhammad Umar, Yun Ni, Mohamed S. Abdelfattah, Zhiru Zhang, Liqun Cheng, Martin G. Dixon, Norman P. Jouppi, Quoc V. Le, Sheng Li

    Abstract: Quantization has become a mainstream compression technique for reducing model size, computational requirements, and energy consumption for modern deep neural networks (DNNs). With improved numerical support in recent hardware, including multiple variants of integer and floating point, mixed-precision quantization has become necessary to achieve high-quality results with low model cost. Prior mixed… ▽ More

    Submitted 1 May, 2024; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: Accepted to AutoML 2024

  9. arXiv:2305.10429  [pdf, other

    cs.CL cs.LG

    DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

    Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu

    Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of do… ▽ More

    Submitted 20 November, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  10. arXiv:2305.08298  [pdf, other

    cs.CL

    Symbol tuning improves in-context learning in language models

    Authors: Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, Quoc V. Le

    Abstract: We present symbol tuning - finetuning language models on in-context input-label pairs where natural language labels (e.g., "positive/negative sentiment") are replaced with arbitrary symbols (e.g., "foo/bar"). Symbol tuning leverages the intuition that when a model cannot use instructions or natural language labels to figure out a task, it must instead do so by learning the input-label map**s.… ▽ More

    Submitted 30 December, 2023; v1 submitted 14 May, 2023; originally announced May 2023.

    Comments: EMNLP 2023

  11. arXiv:2302.06675  [pdf, other

    cs.LG cs.AI cs.CL cs.CV cs.NE

    Symbolic Discovery of Optimization Algorithms

    Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le

    Abstract: We present a method to formulate algorithm discovery as program search, and apply it to discover optimization algorithms for deep neural network training. We leverage efficient search techniques to explore an infinite and sparse program space. To bridge the large generalization gap between proxy and target tasks, we also introduce program selection and simplification strategies. Our method discove… ▽ More

    Submitted 8 May, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

    Comments: 30 pages, Lion is successfully deployed in production systems. We also add comparison with other automatically discovered optimizers

  12. arXiv:2302.05433  [pdf, other

    cs.LG cs.NE

    Unified Functional Hashing in Automatic Machine Learning

    Authors: Ryan Gillard, Stephen Jonany, Yingjie Miao, Michael Munn, Connal de Souza, Jonathan Dungay, Chen Liang, David R. So, Quoc V. Le, Esteban Real

    Abstract: The field of Automatic Machine Learning (AutoML) has recently attained impressive results, including the discovery of state-of-the-art machine learning solutions, such as neural image classifiers. This is often done by applying an evolutionary search method, which samples multiple candidate solutions from a large space and evaluates the quality of each candidate through a long training process. As… ▽ More

    Submitted 10 February, 2023; originally announced February 2023.

    ACM Class: I.2.2; I.2.6

  13. arXiv:2302.03917  [pdf, other

    cs.SD cs.LG eess.AS

    Noise2Music: Text-conditioned Music Generation with Diffusion Models

    Authors: Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han

    Abstract: We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts. Two types of diffusion models, a generator model, which generates an intermediate representation conditioned on text, and a cascader model, which generates high-fidelity audio conditioned on the intermediate representation and possibly the text, are trained and… ▽ More

    Submitted 6 March, 2023; v1 submitted 8 February, 2023; originally announced February 2023.

    Comments: 15 pages

  14. arXiv:2302.01918  [pdf, other

    cs.LG cs.SC

    PyGlove: Efficiently Exchanging ML Ideas as Code

    Authors: Daiyi Peng, Xuanyi Dong, Esteban Real, Yifeng Lu, Quoc V. Le

    Abstract: The increasing complexity and scale of machine learning (ML) has led to the need for more efficient collaboration among multiple teams. For example, when a research team invents a new architecture like "ResNet," it is desirable for multiple engineering teams to adopt it. However, the effort required for each team to study and understand the invention does not scale well with the number of teams or… ▽ More

    Submitted 3 February, 2023; originally announced February 2023.

    Comments: 8 pages, 10 figures, 1 table

  15. arXiv:2301.13688  [pdf, other

    cs.AI cs.CL cs.LG

    The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

    Authors: Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, Adam Roberts

    Abstract: We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniqu… ▽ More

    Submitted 14 February, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

  16. arXiv:2211.02011  [pdf, other

    cs.CL

    Inverse scaling can become U-shaped

    Authors: Jason Wei, Najoung Kim, Yi Tay, Quoc V. Le

    Abstract: Scaling up language models has been empirically shown to improve performance on a wide range of downstream tasks. However, if we were to observe worse performance as a function of scale ("inverse scaling") on certain tasks, this would indicate that scaling can also encourage behaviors that are misaligned with human preferences. The Inverse Scaling Prize (McKenzie et al. 2022) identified eleven suc… ▽ More

    Submitted 24 May, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: v5 includes a reframed discussion section and new chain-of-thought results for Round 2 tasks

  17. arXiv:2210.11416  [pdf, other

    cs.LG cs.CL

    Scaling Instruction-Finetuned Language Models

    Authors: Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yan** Huang , et al. (10 additional authors not shown)

    Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects d… ▽ More

    Submitted 6 December, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: Public checkpoints: https://huggingface.co/docs/transformers/model_doc/flan-t5

  18. arXiv:2210.11399  [pdf, other

    cs.CL cs.AI cs.LG

    Transcending Scaling Laws with 0.1% Extra Compute

    Authors: Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, **feng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani

    Abstract: Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objec… ▽ More

    Submitted 16 November, 2022; v1 submitted 20 October, 2022; originally announced October 2022.

    Comments: V2 has updated references/related work

  19. arXiv:2210.10879  [pdf, other

    cs.LG cs.CL cs.SD eess.AS

    G-Augment: Searching for the Meta-Structure of Data Augmentation Policies for ASR

    Authors: Gary Wang, Ekin D. Cubuk, Andrew Rosenberg, Shuyang Cheng, Ron J. Weiss, Bhuvana Ramabhadran, Pedro J. Moreno, Quoc V. Le, Daniel S. Park

    Abstract: Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as… ▽ More

    Submitted 24 October, 2022; v1 submitted 19 October, 2022; originally announced October 2022.

    Comments: 6 pages, accepted at SLT 2022. Updated with copyright

  20. arXiv:2210.09261  [pdf, other

    cs.CL cs.AI

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Authors: Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, Jason Wei

    Abstract: BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language… ▽ More

    Submitted 17 October, 2022; originally announced October 2022.

    Comments: GitHub repository: https://github.com/suzgunmirac/BIG-Bench-Hard

  21. arXiv:2203.12683  [pdf, other

    cs.CV cs.AI

    Revisiting Multi-Scale Feature Fusion for Semantic Segmentation

    Authors: Tianjian Meng, Golnaz Ghiasi, Reza Mahjourian, Quoc V. Le, Mingxing Tan

    Abstract: It is commonly believed that high internal resolution combined with expensive operations (e.g. atrous convolutions) are necessary for accurate semantic segmentation, resulting in slow speed and large memory usage. In this paper, we question this belief and demonstrate that neither high internal resolution nor atrous convolutions are necessary. Our intuition is that although segmentation is a dense… ▽ More

    Submitted 14 June, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

  22. arXiv:2203.08195  [pdf, other

    cs.CV

    DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

    Authors: Yingwei Li, Adams Wei Yu, Tianjian Meng, Ben Caine, Jiquan Ngiam, Daiyi Peng, Junyang Shen, Bo Wu, Yifeng Lu, Denny Zhou, Quoc V. Le, Alan Yuille, Mingxing Tan

    Abstract: Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While prevalent multi-modal methods simply decorate raw lidar point clouds with camera features and feed them directly to existing 3D detection models, our study shows that fusing camera features with deep lidar features instead of raw points, can lead to better performance. Howev… ▽ More

    Submitted 15 March, 2022; originally announced March 2022.

    Comments: CVPR 2022. 1st rank 3D detection method on Waymo Challenge Leaderboard: https://waymo.com/open/challenges/entry/?timestamp=1647356360224524&challenge=DETECTION_3D&emailId=5451f123-a0ea

  23. arXiv:2202.10447  [pdf, other

    cs.LG cs.AI cs.CL cs.NE

    Transformer Quality in Linear Time

    Authors: Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

    Abstract: We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in… ▽ More

    Submitted 27 June, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

    Comments: Accepted to the 39th International Conference on Machine Learning (ICML'22)

  24. arXiv:2112.06905  [pdf, other

    cs.CL

    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

    Authors: Nan Du, Yan** Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu , et al. (2 additional authors not shown)

    Abstract: Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amounts of computing resources. In this paper, we propose and develop a family of language models named GL… ▽ More

    Submitted 1 August, 2022; v1 submitted 13 December, 2021; originally announced December 2021.

    Comments: Accepted to ICML 2022

  25. arXiv:2111.10050  [pdf, other

    cs.LG cs.CL cs.CV

    Combined Scaling for Zero-shot Transfer Learning

    Authors: Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le

    Abstract: We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution sh… ▽ More

    Submitted 12 April, 2023; v1 submitted 19 November, 2021; originally announced November 2021.

  26. arXiv:2109.13226  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, Daniel S. Park, Wei Han, James Qin, Anmol Gulati, Joel Shor, Aren Jansen, Yuanzhong Xu, Yan** Huang, Shibo Wang, Zongwei Zhou, Bo Li, Min Ma, William Chan, Jiahui Yu, Yongqiang Wang, Liangliang Cao, Khe Chai Sim, Bhuvana Ramabhadran, Tara N. Sainath, Françoise Beaufays, Zhifeng Chen, Quoc V. Le, Chung-Cheng Chiu, Ruoming Pang , et al. (1 additional authors not shown)

    Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled da… ▽ More

    Submitted 21 July, 2022; v1 submitted 27 September, 2021; originally announced September 2021.

    Comments: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated

  27. arXiv:2109.08668  [pdf, other

    cs.LG cs.AI cs.CL cs.NE

    Primer: Searching for Efficient Transformers for Language Modeling

    Authors: David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le

    Abstract: Large Transformer models have been central to recent advances in natural language processing. The training and inference costs of these models, however, have grown rapidly and become prohibitively expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. Compared to previous approaches, our search is performed at a lower level, over the primitives that d… ▽ More

    Submitted 24 January, 2022; v1 submitted 17 September, 2021; originally announced September 2021.

    Comments: "Primer: Searching for Efficient Transformers for Language Modeling" NeurIPS camera ready. 34 pages

  28. arXiv:2109.06270  [pdf, other

    cs.CL

    STraTA: Self-Training with Task Augmentation for Better Few-shot Learning

    Authors: Tu Vu, Minh-Thang Luong, Quoc V. Le, Grady Simon, Mohit Iyyer

    Abstract: Despite their recent successes in tackling many NLP tasks, large-scale pre-trained language models do not perform as well in few-shot settings where only a handful of training examples are available. To address this shortcoming, we propose STraTA, which stands for Self-Training with Task Augmentation, an approach that builds on two key ideas for effective leverage of unlabeled data. First, STraTA… ▽ More

    Submitted 12 April, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

    Comments: Accepted as a main conference paper at EMNLP 2021, 17 pages, 3 figures, 11 tables

  29. arXiv:2109.01652  [pdf, other

    cs.CL

    Finetuned Language Models Are Zero-Shot Learners

    Authors: Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, Quoc V. Le

    Abstract: This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning -- finetuning language models on a collection of tasks described via instructions -- substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction-tune it on over 60 NLP tasks verbalized via natur… ▽ More

    Submitted 8 February, 2022; v1 submitted 3 September, 2021; originally announced September 2021.

    Comments: Version 5. Find list of changes in Appendix F (page 35)

  30. arXiv:2108.11353  [pdf, other

    cs.CV

    Multi-Task Self-Training for Learning General Representations

    Authors: Golnaz Ghiasi, Barret Zoph, Ekin D. Cubuk, Quoc V. Le, Tsung-Yi Lin

    Abstract: Despite the fast progress in training specialized models for various tasks, learning a single general model that works well for many tasks is still challenging for computer vision. Here we introduce multi-task self-training (MuST), which harnesses the knowledge in independent specialized teacher models (e.g., ImageNet model on classification) to train a single general student model. Our approach h… ▽ More

    Submitted 25 August, 2021; originally announced August 2021.

    Comments: ICCV 2021

  31. arXiv:2106.04803  [pdf, other

    cs.CV cs.LG

    CoAtNet: Marrying Convolution and Attention for All Data Sizes

    Authors: Zihang Dai, Hanxiao Liu, Quoc V. Le, Mingxing Tan

    Abstract: Transformers have attracted increasing interests in computer vision, but they still fall behind state-of-the-art convolutional networks. In this work, we show that while Transformers tend to have larger model capacity, their generalization can be worse than convolutional networks due to the lack of the right inductive bias. To effectively combine the strengths from both architectures, we present C… ▽ More

    Submitted 15 September, 2021; v1 submitted 9 June, 2021; originally announced June 2021.

  32. arXiv:2105.08050  [pdf, other

    cs.LG cs.CL cs.CV

    Pay Attention to MLPs

    Authors: Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le

    Abstract: Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and show that it can perform as well as Transformers in key language and vision applications. Our comparisons show that self-attention is not critical for Vision Tra… ▽ More

    Submitted 1 June, 2021; v1 submitted 17 May, 2021; originally announced May 2021.

  33. arXiv:2104.00298  [pdf, other

    cs.CV

    EfficientNetV2: Smaller Models and Faster Training

    Authors: Mingxing Tan, Quoc V. Le

    Abstract: This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with ne… ▽ More

    Submitted 23 June, 2021; v1 submitted 1 April, 2021; originally announced April 2021.

    Comments: ICML 2021

    Journal ref: International Conference on Machine Learning, 2021

  34. arXiv:2102.05918  [pdf, other

    cs.CV cs.CL cs.LG

    Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

    Authors: Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig

    Abstract: Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datase… ▽ More

    Submitted 11 June, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

    Comments: ICML 2021

    Journal ref: International Conference on Machine Learning 2021

  35. arXiv:2101.08809  [pdf, other

    cs.LG cs.PL

    PyGlove: Symbolic Programming for Automated Machine Learning

    Authors: Daiyi Peng, Xuanyi Dong, Esteban Real, Mingxing Tan, Yifeng Lu, Hanxiao Liu, Gabriel Bender, Adam Kraft, Chen Liang, Quoc V. Le

    Abstract: Neural networks are sensitive to hyper-parameter and architecture choices. Automated Machine Learning (AutoML) is a promising paradigm for automating these choices. Current ML software libraries, however, are quite limited in handling the dynamic interactions among the components of AutoML. For example, efficientNAS algorithms, such as ENAS and DARTS, typically require an implementation coupling b… ▽ More

    Submitted 21 January, 2021; originally announced January 2021.

    Comments: NeurIPS 2020 Oral

  36. arXiv:2101.03958  [pdf, other

    cs.LG cs.AI cs.NE

    Evolving Reinforcement Learning Algorithms

    Authors: John D. Co-Reyes, Yingjie Miao, Daiyi Peng, Esteban Real, Sergey Levine, Quoc V. Le, Honglak Lee, Aleksandra Faust

    Abstract: We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, l… ▽ More

    Submitted 10 November, 2022; v1 submitted 8 January, 2021; originally announced January 2021.

    Comments: ICLR 2021 Oral. See project website at https://sites.google.com/view/evolvingrl

  37. arXiv:2101.01761  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    AutoDropout: Learning Dropout Patterns to Regularize Deep Networks

    Authors: Hieu Pham, Quoc V. Le

    Abstract: Neural networks are often over-parameterized and hence benefit from aggressive regularization. Conventional regularization methods, such as Dropout or weight decay, do not leverage the structures of the network's inputs and hidden states. As a result, these conventional methods are less effective than methods that leverage the structures, such as SpatialDropout and DropBlock, which randomly drop t… ▽ More

    Submitted 5 January, 2021; originally announced January 2021.

    Comments: Accepted to AAAI 2021

  38. arXiv:2012.08561  [pdf, other

    cs.CL

    Pre-Training Transformers as Energy-Based Cloze Models

    Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

    Abstract: We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train… ▽ More

    Submitted 15 December, 2020; originally announced December 2020.

    Comments: EMNLP 2020

  39. arXiv:2012.07177  [pdf, other

    cs.CV

    Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation

    Authors: Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-Yi Lin, Ekin D. Cubuk, Quoc V. Le, Barret Zoph

    Abstract: Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation ([13, 12]) for instance segmentation where we randomly paste objects onto an image. Prior studies… ▽ More

    Submitted 23 June, 2021; v1 submitted 13 December, 2020; originally announced December 2020.

    Comments: Accepted at CVPR 2021 (Oral)

  40. arXiv:2011.04419  [pdf, other

    cs.LG cs.AI stat.ML

    Towards Domain-Agnostic Contrastive Learning

    Authors: Vikas Verma, Minh-Thang Luong, Kenji Kawaguchi, Hieu Pham, Quoc V. Le

    Abstract: Despite recent success, most contrastive self-supervised learning methods are domain-specific, relying heavily on data augmentation techniques that require knowledge about a particular domain, such as image crop** and rotation. To overcome such limitation, we propose a novel domain-agnostic approach to contrastive learning, named DACL, that is applicable to domains where invariances, and thus, d… ▽ More

    Submitted 19 July, 2021; v1 submitted 9 November, 2020; originally announced November 2020.

    Comments: Published in ICML 2021

  41. arXiv:2010.10504  [pdf, other

    eess.AS cs.LG cs.SD

    Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

    Authors: Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, Yonghui Wu

    Abstract: We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech utilizing the unlabeled audio of the Libri-Light dataset. More precisely, we carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training. By doing so, we are able to achieve word-e… ▽ More

    Submitted 20 July, 2022; v1 submitted 20 October, 2020; originally announced October 2020.

    Comments: 11 pages, 3 figures, 5 tables. Accepted to NeurIPS SAS 2020 Workshop; v2: minor errors corrected

  42. arXiv:2006.14536  [pdf, other

    cs.LG cs.CV cs.NE

    Smooth Adversarial Training

    Authors: Cihang Xie, Mingxing Tan, Boqing Gong, Alan Yuille, Quoc V. Le

    Abstract: It is commonly believed that networks cannot be both accurate and robust, that gaining robustness means losing accuracy. It is also generally believed that, unless making networks larger, network architectural elements would otherwise matter little in improving adversarial robustness. Here we present evidence to challenge these common beliefs by a careful study about adversarial training. Our key… ▽ More

    Submitted 10 July, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

    Comments: tech report

  43. arXiv:2006.06882  [pdf, other

    cs.CV cs.LG stat.ML

    Rethinking Pre-training and Self-training

    Authors: Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D. Cubuk, Quoc V. Le

    Abstract: Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al., however, show a surprising result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same… ▽ More

    Submitted 15 November, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

    Comments: Accepted for publication at the Thirty-fourth Conference on Neural Information Processing Systems (NeurIPS 2020)

  44. arXiv:2006.03656  [pdf, other

    cs.CV

    AutoHAS: Efficient Hyperparameter and Architecture Search

    Authors: Xuanyi Dong, Mingxing Tan, Adams Wei Yu, Daiyi Peng, Bogdan Gabrys, Quoc V. Le

    Abstract: Efficient hyperparameter or architecture search methods have shown remarkable results, but each of them is only applicable to searching for either hyperparameters (HPs) or architectures. In this work, we propose a unified pipeline, AutoHAS, to efficiently search for both architectures and hyperparameters. AutoHAS learns to alternately update the shared network weights and a reinforcement learning… ▽ More

    Submitted 7 April, 2021; v1 submitted 5 June, 2020; originally announced June 2020.

    Comments: Accepted to 2nd Workshop on Neural Architecture Search at ICLR 2021

  45. arXiv:2006.03236  [pdf, other

    cs.LG cs.CL stat.ML

    Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

    Authors: Zihang Dai, Guokun Lai, Yiming Yang, Quoc V. Le

    Abstract: With the success of language pretraining, it is highly desirable to develop more efficient architectures of good scalability that can exploit the abundant unlabeled data at a lower cost. To improve the efficiency, we examine the much-overlooked redundancy in maintaining a full-length token-level presentation, especially for tasks that only require a single-vector presentation of the sequence. With… ▽ More

    Submitted 5 June, 2020; originally announced June 2020.

  46. Improved Noisy Student Training for Automatic Speech Recognition

    Authors: Daniel S. Park, Yu Zhang, Ye Jia, Wei Han, Chung-Cheng Chiu, Bo Li, Yonghui Wu, Quoc V. Le

    Abstract: Recently, a semi-supervised learning method known as "noisy student training" has been shown to improve image classification performance of deep networks significantly. Noisy student training is an iterative self-training method that leverages augmentation to improve network performance. In this work, we adapt and improve noisy student training for automatic speech recognition, employing (adaptive… ▽ More

    Submitted 29 October, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 5 figures, 4 tables; v2: minor revisions, reference added

    Journal ref: Proc. Interspeech 2020, 2817-2821

  47. arXiv:2004.10746  [pdf, other

    cs.LG cs.AI

    Chip Placement with Deep Reinforcement Learning

    Authors: Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc V. Le, James Laudon, Richard Ho, Roger Carpenter, Jeff Dean

    Abstract: In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously… ▽ More

    Submitted 22 April, 2020; originally announced April 2020.

  48. arXiv:2004.02967  [pdf, other

    cs.LG cs.CV cs.NE stat.ML

    Evolving Normalization-Activation Layers

    Authors: Hanxiao Liu, Andrew Brock, Karen Simonyan, Quoc V. Le

    Abstract: Normalization layers and activation functions are fundamental components in deep networks and typically co-locate with each other. Here we propose to design them using an automated approach. Instead of designing them separately, we unify them into a single tensor-to-tensor computation graph, and evolve its structure starting from basic mathematical functions. Examples of such mathematical function… ▽ More

    Submitted 17 July, 2020; v1 submitted 6 April, 2020; originally announced April 2020.

  49. arXiv:2004.00831  [pdf, other

    cs.CV

    Improving 3D Object Detection through Progressive Population Based Augmentation

    Authors: Shuyang Cheng, Zhaoqi Leng, Ekin Dogus Cubuk, Barret Zoph, Chunyan Bai, Jiquan Ngiam, Yang Song, Benjamin Caine, Vijay Vasudevan, Congcong Li, Quoc V. Le, Jonathon Shlens, Dragomir Anguelov

    Abstract: Data augmentation has been widely adopted for object detection in 3D point clouds. However, all previous related efforts have focused on manually designing specific data augmentation methods for individual architectures. In this work, we present the first attempt to automate the design of data augmentation policies for 3D object detection. We introduce the Progressive Population Based Augmentation… ▽ More

    Submitted 16 July, 2020; v1 submitted 2 April, 2020; originally announced April 2020.

    Comments: Accepted at ECCV 2020

  50. arXiv:2003.10580  [pdf, other

    cs.LG stat.ML

    Meta Pseudo Labels

    Authors: Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, Quoc V. Le

    Abstract: We present Meta Pseudo Labels, a semi-supervised learning method that achieves a new state-of-the-art top-1 accuracy of 90.2% on ImageNet, which is 1.6% better than the existing state-of-the-art. Like Pseudo Labels, Meta Pseudo Labels has a teacher network to generate pseudo labels on unlabeled data to teach a student network. However, unlike Pseudo Labels where the teacher is fixed, the teacher i… ▽ More

    Submitted 1 March, 2021; v1 submitted 23 March, 2020; originally announced March 2020.

    Comments: Preprint