Skip to main content

Showing 1–25 of 25 results for author: Dahl, G E

.
  1. arXiv:2306.07179  [pdf, other

    cs.LG stat.ML

    Benchmarking Neural Network Training Algorithms

    Authors: George E. Dahl, Frank Schneider, Zachary Nado, Naman Agarwal, Chandramouli Shama Sastry, Philipp Hennig, Sourabh Medapati, Runa Eschenhagen, Priya Kasimbeg, Daniel Suo, Juhan Bae, Justin Gilmer, Abel L. Peirson, Bilal Khan, Rohan Anil, Mike Rabbat, Shankar Krishnan, Daniel Snider, Ehsan Amid, Kongtao Chen, Chris J. Maddison, Rakshith Vasudev, Michal Badura, Ankush Garg, Peter Mattson

    Abstract: Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a communi… ▽ More

    Submitted 12 June, 2023; originally announced June 2023.

    Comments: 102 pages, 8 figures, 41 tables

  2. arXiv:2207.14484  [pdf, other

    cs.LG

    Adaptive Gradient Methods at the Edge of Stability

    Authors: Jeremy M. Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E. Dahl, Justin Gilmer

    Abstract: Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical… ▽ More

    Submitted 15 April, 2024; v1 submitted 29 July, 2022; originally announced July 2022.

    Comments: v2 corrects the formula for Adam's preconditioner in Eq 2

  3. arXiv:2207.03084  [pdf, other

    cs.LG cs.AI stat.ML

    Pre-training helps Bayesian optimization too

    Authors: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zelda Mariet, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani

    Abstract: Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs o… ▽ More

    Submitted 7 July, 2022; originally announced July 2022.

    Comments: ICML2022 Workshop on Adaptive Experimental Design and Active Learning in the Real World. arXiv admin note: substantial text overlap with arXiv:2109.08215

  4. arXiv:2203.10139  [pdf

    cs.LG cs.AI cs.CV eess.IV

    AI system for fetal ultrasound in low-resource settings

    Authors: Ryan G. Gomes, Bellington Vwalika, Chace Lee, Angelica Willis, Marcin Sieniek, Joan T. Price, Christina Chen, Margaret P. Kasaro, James A. Taylor, Elizabeth M. Stringer, Scott Mayer McKinney, Ntazana Sindano, George E. Dahl, William Goodnight III, Justin Gilmer, Benjamin H. Chi, Charles Lau, Terry Spitz, T Saensuksopa, Kris Liu, Jonny Wong, Rory Pilgrim, Akib Uddin, Greg Corrado, Lily Peng , et al. (4 additional authors not shown)

    Abstract: Despite considerable progress in maternal healthcare, maternal and perinatal deaths remain high in low-to-middle income countries. Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption. We developed and validated an artificial intelligence (AI) system that uses novice-acquired "blind sweep" ultrasound videos to… ▽ More

    Submitted 18 March, 2022; originally announced March 2022.

  5. arXiv:2112.08250  [pdf, other

    cs.LG

    Predicting the utility of search spaces for black-box optimization: a simple, budget-aware approach

    Authors: Setareh Ariafar, Justin Gilmer, Zachary Nado, Jasper Snoek, Rodolphe Jenatton, George E. Dahl

    Abstract: Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limite… ▽ More

    Submitted 16 December, 2021; v1 submitted 15 December, 2021; originally announced December 2021.

  6. arXiv:2109.08215  [pdf, other

    cs.LG stat.ML

    Pre-trained Gaussian processes for Bayesian optimization

    Authors: Zi Wang, George E. Dahl, Kevin Swersky, Chansoo Lee, Zelda Mariet, Zachary Nado, Justin Gilmer, Jasper Snoek, Zoubin Ghahramani

    Abstract: Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs o… ▽ More

    Submitted 6 July, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

  7. arXiv:2104.02145  [pdf, other

    cs.CL

    What Will it Take to Fix Benchmarking in Natural Language Understanding?

    Authors: Samuel R. Bowman, George E. Dahl

    Abstract: Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform… ▽ More

    Submitted 15 October, 2021; v1 submitted 5 April, 2021; originally announced April 2021.

    Comments: Proceedings of NAACL 2020. This revision adds a missing acknowledgment

  8. arXiv:2102.06356  [pdf, other

    cs.LG stat.ML

    A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

    Authors: Zachary Nado, Justin M. Gilmer, Christopher J. Shallue, Rohan Anil, George E. Dahl

    Abstract: Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether L… ▽ More

    Submitted 9 June, 2021; v1 submitted 12 February, 2021; originally announced February 2021.

  9. arXiv:1910.05446  [pdf, other

    cs.LG stat.ML

    On Empirical Comparisons of Optimizers for Deep Learning

    Authors: Dami Choi, Christopher J. Shallue, Zachary Nado, Jaehoon Lee, Chris J. Maddison, George E. Dahl

    Abstract: Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that t… ▽ More

    Submitted 15 June, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

  10. arXiv:1907.05550  [pdf, other

    cs.LG

    Faster Neural Network Training with Data Echoing

    Authors: Dami Choi, Alexandre Passos, Christopher J. Shallue, George E. Dahl

    Abstract: In the twilight of Moore's law, GPUs and other specialized hardware accelerators have dramatically sped up neural network training. However, earlier stages of the training pipeline, such as disk I/O and data preprocessing, do not run on accelerators. As accelerators continue to improve, these earlier stages will increasingly become the bottleneck. In this paper, we introduce "data echoing," which… ▽ More

    Submitted 7 May, 2020; v1 submitted 11 July, 2019; originally announced July 2019.

  11. arXiv:1907.04164  [pdf, other

    cs.LG stat.ML

    Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

    Authors: Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse

    Abstract: Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noi… ▽ More

    Submitted 28 October, 2019; v1 submitted 9 July, 2019; originally announced July 2019.

    Comments: NeurIPS 2019

  12. arXiv:1811.03600  [pdf, other

    cs.LG stat.ML

    Measuring the Effects of Data Parallelism on Neural Network Training

    Authors: Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

    Abstract: Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by… ▽ More

    Submitted 18 July, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

    Journal ref: Journal of Machine Learning Research 20 (2019) 1-49

  13. arXiv:1808.07910  [pdf, ps, other

    cs.LG cs.CL stat.ML

    The Importance of Generation Order in Language Modeling

    Authors: Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, George E. Dahl

    Abstract: Neural language models are a critical component of state-of-the-art systems for machine translation, summarization, audio transcription, and other tasks. These language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. This paper studies the influence of token generation order on model quality via a novel two-pass language model th… ▽ More

    Submitted 23 August, 2018; originally announced August 2018.

  14. arXiv:1807.06732  [pdf, other

    cs.LG stat.ML

    Motivating the Rules of the Game for Adversarial Example Research

    Authors: Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, George E. Dahl

    Abstract: Advances in machine learning have led to broad deployment of systems with impressive performance on important problems. Nonetheless, these systems can be induced to make errors on data that are surprisingly similar to examples the learned system handles correctly. The existence of these errors raises a variety of questions about out-of-sample generalization and whether bad actors might use such ex… ▽ More

    Submitted 19 July, 2018; v1 submitted 17 July, 2018; originally announced July 2018.

  15. arXiv:1806.04313  [pdf, other

    cs.CL cs.LG

    Embedding Text in Hyperbolic Spaces

    Authors: Bhuwan Dhingra, Christopher J. Shallue, Mohammad Norouzi, Andrew M. Dai, George E. Dahl

    Abstract: Natural language text exhibits hierarchical structure in a variety of respects. Ideally, we could incorporate our prior knowledge of this hierarchical structure into unsupervised learning algorithms that work on text data. Recent work by Nickel & Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results when emb… ▽ More

    Submitted 11 June, 2018; originally announced June 2018.

    Comments: TextGraphs 2018

  16. arXiv:1805.10255  [pdf, other

    cs.CV cs.AI cs.LG cs.NE

    Parallel Architecture and Hyperparameter Search via Successive Halving and Classification

    Authors: Manoj Kumar, George E. Dahl, Vijay Vasudevan, Mohammad Norouzi

    Abstract: We present a simple and powerful algorithm for parallel black box optimization called Successive Halving and Classification (SHAC). The algorithm operates in $K$ stages of parallel function evaluations and trains a cascade of binary classifiers to iteratively cull the undesirable regions of the search space. SHAC is easy to implement, requires no tuning of its own configuration parameters, is inva… ▽ More

    Submitted 25 May, 2018; originally announced May 2018.

  17. arXiv:1804.03235  [pdf, other

    cs.LG cs.AI stat.ML

    Large scale distributed neural network training through online distillation

    Authors: Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, Geoffrey E. Hinton

    Abstract: Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward… ▽ More

    Submitted 20 August, 2020; v1 submitted 9 April, 2018; originally announced April 2018.

    Comments: Clarify that implementations should use available parallelism in pseudo-code

  18. arXiv:1704.01212  [pdf, other

    cs.LG

    Neural Message Passing for Quantum Chemistry

    Authors: Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl

    Abstract: Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At… ▽ More

    Submitted 12 June, 2017; v1 submitted 4 April, 2017; originally announced April 2017.

    Comments: 14 pages

    ACM Class: I.2.6

  19. arXiv:1703.02442  [pdf, other

    cs.CV

    Detecting Cancer Metastases on Gigapixel Pathology Images

    Authors: Yun Liu, Krishna Gadepalli, Mohammad Norouzi, George E. Dahl, Timo Kohlberger, Aleksey Boyko, Subhashini Venugopalan, Aleksei Timofeev, Philip Q. Nelson, Greg S. Corrado, Jason D. Hipp, Lily Peng, Martin C. Stumpe

    Abstract: Each year, the treatment decisions for more than 230,000 breast cancer patients in the U.S. hinge on whether the cancer has metastasized away from the breast. Metastasis detection is currently performed by pathologists reviewing large expanses of biological tissues. This process is labor intensive and error-prone. We present a framework to automatically detect and localize tumors as small as 100 x… ▽ More

    Submitted 7 March, 2017; v1 submitted 3 March, 2017; originally announced March 2017.

    Comments: Fig 1: normal and tumor patches were accidentally reversed - now fixed. Minor grammatical corrections in appendix, section "Image Color Normalization"

    Journal ref: MICCAI Tutorial (2017)

  20. Machine learning prediction errors better than DFT accuracy

    Authors: Felix A. Faber, Luke Hutchison, Bing Huang, Justin Gilmer, Samuel S. Schoenholz, George E. Dahl, Oriol Vinyals, Steven Kearnes, Patrick F. Riley, O. Anatole von Lilienfeld

    Abstract: We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k… ▽ More

    Submitted 4 June, 2017; v1 submitted 17 February, 2017; originally announced February 2017.

  21. arXiv:1408.2039  [pdf

    cs.LG stat.ML

    Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes

    Authors: Ryan Prescott Adams, George E. Dahl, Iain Murray

    Abstract: Probabilistic matrix factorization (PMF) is a powerful method for modeling data associ- ated with pairwise relationships, Finding use in collaborative Filtering, computational bi- ology, and document analysis, among other areas. In many domains, there are additional covariates that can assist in prediction. For example, when modeling movie ratings, we might know when the rating occurred, where the… ▽ More

    Submitted 9 August, 2014; originally announced August 2014.

    Comments: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence (UAI2010)

    Report number: UAI-P-2010-PG-1-9

  22. arXiv:1406.1231  [pdf, other

    stat.ML cs.LG cs.NE

    Multi-task Neural Networks for QSAR Predictions

    Authors: George E. Dahl, Navdeep Jaitly, Ruslan Salakhutdinov

    Abstract: Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approache… ▽ More

    Submitted 4 June, 2014; originally announced June 2014.

  23. arXiv:1309.1501  [pdf, ps, other

    cs.LG cs.CL cs.NE math.OC stat.ML

    Improvements to deep convolutional neural networks for LVCSR

    Authors: Tara N. Sainath, Brian Kingsbury, Abdel-rahman Mohamed, George E. Dahl, George Saon, Hagen Soltau, Tomas Beran, Aleksandr Y. Aravkin, Bhuvana Ramabhadran

    Abstract: Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further imp… ▽ More

    Submitted 10 December, 2013; v1 submitted 5 September, 2013; originally announced September 2013.

    Comments: 6 pages, 1 figure

    MSC Class: 65K05; 90C15; 90C90

  24. arXiv:1202.5695  [pdf, other

    cs.LG stat.ML

    Training Restricted Boltzmann Machines on Word Observations

    Authors: George E. Dahl, Ryan P. Adams, Hugo Larochelle

    Abstract: The restricted Boltzmann machine (RBM) is a flexible tool for modeling complex data, however there have been significant computational difficulties in using RBMs to model high-dimensional multinomial observations. In natural language processing applications, words are naturally modeled by K-ary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundreds o… ▽ More

    Submitted 5 July, 2012; v1 submitted 25 February, 2012; originally announced February 2012.

  25. arXiv:1003.4944  [pdf, other

    stat.ML cs.LG

    Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes

    Authors: Ryan Prescott Adams, George E. Dahl, Iain Murray

    Abstract: Probabilistic matrix factorization (PMF) is a powerful method for modeling data associated with pairwise relationships, finding use in collaborative filtering, computational biology, and document analysis, among other areas. In many domains, there is additional information that can assist in prediction. For example, when modeling movie ratings, we might know when the rating occurred, where the u… ▽ More

    Submitted 25 March, 2010; originally announced March 2010.

    Comments: 18 pages, 4 figures, Submitted to UAI 2010