Skip to main content

Showing 1–7 of 7 results for author: Hu, E J

.
  1. arXiv:2310.04363  [pdf, other

    cs.LG cs.CL

    Amortizing intractable inference in large language models

    Authors: Edward J. Hu, Moksh Jain, Eric Elmoznino, Younesse Kaddar, Guillaume Lajoie, Yoshua Bengio, Nikolay Malkin

    Abstract: Autoregressive large language models (LLMs) compress knowledge from their training data through next-token conditional distributions. This limits tractable querying of this knowledge to start-to-end autoregressive sampling. However, many tasks of interest -- including sequence continuation, infilling, and other forms of constrained generation -- involve sampling from intractable posterior distribu… ▽ More

    Submitted 13 March, 2024; v1 submitted 6 October, 2023; originally announced October 2023.

    Comments: ICLR 2024; 23 pages; code: https://github.com/GFNOrg/gfn-lm-tuning

  2. arXiv:2302.06576  [pdf, other

    cs.LG stat.ML

    GFlowNet-EM for learning compositional latent variable models

    Authors: Edward J. Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio

    Abstract: Latent variable models (LVMs) with discrete compositional latents are an important but challenging setting due to a combinatorially large number of possible configurations of the latents. A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. For algorithms based on expectation-maximization (EM), the E-step is often intractable without restrictiv… ▽ More

    Submitted 3 June, 2023; v1 submitted 13 February, 2023; originally announced February 2023.

    Comments: ICML 2023; code: https://github.com/GFNOrg/GFlowNet-EM

  3. arXiv:2203.03466  [pdf, other

    cs.LG cond-mat.dis-nn cs.NE

    Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

    Authors: Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao

    Abstract: Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on… ▽ More

    Submitted 28 March, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: NeurIPS 2021

  4. arXiv:2111.09266  [pdf, other

    cs.LG cs.AI stat.ML

    GFlowNet Foundations

    Authors: Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J. Hu, Mo Tiwari, Emmanuel Bengio

    Abstract: Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets. They can be used to estimate joint probability distributions and the corr… ▽ More

    Submitted 10 July, 2023; v1 submitted 17 November, 2021; originally announced November 2021.

  5. arXiv:2106.09685  [pdf, other

    cs.CL cs.AI cs.LG

    LoRA: Low-Rank Adaptation of Large Language Models

    Authors: Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen

    Abstract: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively… ▽ More

    Submitted 16 October, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

    Comments: Draft V2 includes better baselines, experiments on GLUE, and more on adapter latency

  6. arXiv:2011.14522  [pdf, other

    cs.LG cond-mat.dis-nn cs.NE

    Feature Learning in Infinite-Width Neural Networks

    Authors: Greg Yang, Edward J. Hu

    Abstract: As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial… ▽ More

    Submitted 15 July, 2022; v1 submitted 29 November, 2020; originally announced November 2020.

    Comments: 4th paper in the Tensor Programs series. Appearing in ICML 2021

  7. arXiv:2004.12478  [pdf, other

    cs.LG cs.CR stat.ML

    Improved Image Wasserstein Attacks and Defenses

    Authors: Edward J. Hu, Adith Swaminathan, Hadi Salman, Greg Yang

    Abstract: Robustness against image perturbations bounded by a $\ell_p$ ball have been well-studied in recent literature. Perturbations in the real-world, however, rarely exhibit the pixel independence that $\ell_p$ threat models assume. A recently proposed Wasserstein distance-bounded threat model is a promising alternative that limits the perturbation to pixel mass movements. We point out and rectify flaws… ▽ More

    Submitted 9 May, 2023; v1 submitted 26 April, 2020; originally announced April 2020.

    Comments: Best paper award at ICLR Trustworthy ML Workshop 2020