-
Combinatorial Fiedler Theory and Graph Partition
Authors:
Enide Andrade,
Geir Dahl
Abstract:
Partition problems in graphs are extremely important in applications, as shown in the Data science and Machine learning literature. One approach is spectral partitioning based on a Fiedler vector, i.e., an eigenvector corresponding to the second smallest eigenvalue $a(G)$ of the Laplacian matrix $L_G$ of the graph $G$. This problem corresponds to the minimization of a quadratic form associated wit…
▽ More
Partition problems in graphs are extremely important in applications, as shown in the Data science and Machine learning literature. One approach is spectral partitioning based on a Fiedler vector, i.e., an eigenvector corresponding to the second smallest eigenvalue $a(G)$ of the Laplacian matrix $L_G$ of the graph $G$. This problem corresponds to the minimization of a quadratic form associated with $L_G$, under certain constraints involving the $\ell_2$-norm. We introduce and investigate a similar problem, but using the $\ell_1$-norm to measure distances. This leads to a new parameter $b(G)$ as the optimal value. We show that a well-known cut problem arises in this approach, namely the sparsest cut problem. We prove connectivity results and different bounds on this new parameter, relate to Fiedler theory and show explicit expressions for $b(G)$ for trees. We also comment on an $\ell_{\infty}$-norm version of the problem.
△ Less
Submitted 22 June, 2023;
originally announced June 2023.
-
Benchmarking Neural Network Training Algorithms
Authors:
George E. Dahl,
Frank Schneider,
Zachary Nado,
Naman Agarwal,
Chandramouli Shama Sastry,
Philipp Hennig,
Sourabh Medapati,
Runa Eschenhagen,
Priya Kasimbeg,
Daniel Suo,
Juhan Bae,
Justin Gilmer,
Abel L. Peirson,
Bilal Khan,
Rohan Anil,
Mike Rabbat,
Shankar Krishnan,
Daniel Snider,
Ehsan Amid,
Kongtao Chen,
Chris J. Maddison,
Rakshith Vasudev,
Michal Badura,
Ankush Garg,
Peter Mattson
Abstract:
Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a communi…
▽ More
Training algorithms, broadly construed, are an essential part of every deep learning pipeline. Training algorithm improvements that speed up training across a wide variety of workloads (e.g., better update rules, tuning protocols, learning rate schedules, or data selection schemes) could save time, save computational resources, and lead to better, more accurate, models. Unfortunately, as a community, we are currently unable to reliably identify training algorithm improvements, or even determine the state-of-the-art training algorithm. In this work, using concrete experiments, we argue that real progress in speeding up training requires new benchmarks that resolve three basic challenges faced by empirical comparisons of training algorithms: (1) how to decide when training is complete and precisely measure training time, (2) how to handle the sensitivity of measurements to exact workload details, and (3) how to fairly compare algorithms that require hyperparameter tuning. In order to address these challenges, we introduce a new, competitive, time-to-result benchmark using multiple workloads running on fixed hardware, the AlgoPerf: Training Algorithms benchmark. Our benchmark includes a set of workload variants that make it possible to detect benchmark submissions that are more robust to workload changes than current widely-used methods. Finally, we evaluate baseline submissions constructed using various optimizers that represent current practice, as well as other optimizers that have recently received attention in the literature. These baseline results collectively demonstrate the feasibility of our benchmark, show that non-trivial gaps between methods exist, and set a provisional state-of-the-art for future benchmark submissions to try and surpass.
△ Less
Submitted 12 June, 2023;
originally announced June 2023.
-
Adaptive Gradient Methods at the Edge of Stability
Authors:
Jeremy M. Cohen,
Behrooz Ghorbani,
Shankar Krishnan,
Naman Agarwal,
Sourabh Medapati,
Michal Badura,
Daniel Suo,
David Cardoze,
Zachary Nado,
George E. Dahl,
Justin Gilmer
Abstract:
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical…
▽ More
Very little is known about the training dynamics of adaptive gradient methods like Adam in deep learning. In this paper, we shed light on the behavior of these algorithms in the full-batch and sufficiently large batch settings. Specifically, we empirically demonstrate that during full-batch training, the maximum eigenvalue of the preconditioned Hessian typically equilibrates at a certain numerical value -- the stability threshold of a gradient descent algorithm. For Adam with step size $η$ and $β_1 = 0.9$, this stability threshold is $38/η$. Similar effects occur during minibatch training, especially as the batch size grows. Yet, even though adaptive methods train at the ``Adaptive Edge of Stability'' (AEoS), their behavior in this regime differs in a significant way from that of non-adaptive methods at the EoS. Whereas non-adaptive algorithms at the EoS are blocked from entering high-curvature regions of the loss landscape, adaptive gradient methods at the AEoS can keep advancing into high-curvature regions, while adapting the preconditioner to compensate. Our findings can serve as a foundation for the community's future understanding of adaptive gradient methods in deep learning.
△ Less
Submitted 15 April, 2024; v1 submitted 29 July, 2022;
originally announced July 2022.
-
Pre-training helps Bayesian optimization too
Authors:
Zi Wang,
George E. Dahl,
Kevin Swersky,
Chansoo Lee,
Zelda Mariet,
Zachary Nado,
Justin Gilmer,
Jasper Snoek,
Zoubin Ghahramani
Abstract:
Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs o…
▽ More
Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.
△ Less
Submitted 7 July, 2022;
originally announced July 2022.
-
AI system for fetal ultrasound in low-resource settings
Authors:
Ryan G. Gomes,
Bellington Vwalika,
Chace Lee,
Angelica Willis,
Marcin Sieniek,
Joan T. Price,
Christina Chen,
Margaret P. Kasaro,
James A. Taylor,
Elizabeth M. Stringer,
Scott Mayer McKinney,
Ntazana Sindano,
George E. Dahl,
William Goodnight III,
Justin Gilmer,
Benjamin H. Chi,
Charles Lau,
Terry Spitz,
T Saensuksopa,
Kris Liu,
Jonny Wong,
Rory Pilgrim,
Akib Uddin,
Greg Corrado,
Lily Peng
, et al. (4 additional authors not shown)
Abstract:
Despite considerable progress in maternal healthcare, maternal and perinatal deaths remain high in low-to-middle income countries. Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption. We developed and validated an artificial intelligence (AI) system that uses novice-acquired "blind sweep" ultrasound videos to…
▽ More
Despite considerable progress in maternal healthcare, maternal and perinatal deaths remain high in low-to-middle income countries. Fetal ultrasound is an important component of antenatal care, but shortage of adequately trained healthcare workers has limited its adoption. We developed and validated an artificial intelligence (AI) system that uses novice-acquired "blind sweep" ultrasound videos to estimate gestational age (GA) and fetal malpresentation. We further addressed obstacles that may be encountered in low-resourced settings. Using a simplified sweep protocol with real-time AI feedback on sweep quality, we have demonstrated the generalization of model performance to minimally trained novice ultrasound operators using low cost ultrasound devices with on-device AI integration. The GA model was non-inferior to standard fetal biometry estimates with as few as two sweeps, and the fetal malpresentation model had high AUC-ROCs across operators and devices. Our AI models have the potential to assist in upleveling the capabilities of lightly trained ultrasound operators in low resource settings.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
Predicting the utility of search spaces for black-box optimization: a simple, budget-aware approach
Authors:
Setareh Ariafar,
Justin Gilmer,
Zachary Nado,
Jasper Snoek,
Rodolphe Jenatton,
George E. Dahl
Abstract:
Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limite…
▽ More
Black box optimization requires specifying a search space to explore for solutions, e.g. a d-dimensional compact space, and this choice is critical for getting the best results at a reasonable budget. Unfortunately, determining a high quality search space can be challenging in many applications. For example, when tuning hyperparameters for machine learning pipelines on a new problem given a limited budget, one must strike a balance between excluding potentially promising regions and kee** the search space small enough to be tractable. The goal of this work is to motivate -- through example applications in tuning deep neural networks -- the problem of predicting the quality of search spaces conditioned on budgets, as well as to provide a simple scoring method based on a utility function applied to a probabilistic response surface model, similar to Bayesian optimization. We show that the method we present can compute meaningful budget-conditional scores in a variety of situations. We also provide experimental evidence that accurate scores can be useful in constructing and pruning search spaces. Ultimately, we believe scoring search spaces should become standard practice in the experimental workflow for deep learning.
△ Less
Submitted 16 December, 2021; v1 submitted 15 December, 2021;
originally announced December 2021.
-
A Loss Curvature Perspective on Training Instability in Deep Learning
Authors:
Justin Gilmer,
Behrooz Ghorbani,
Ankush Garg,
Sneha Kudugunta,
Behnam Neyshabur,
David Cardoze,
George Dahl,
Zachary Nado,
Orhan Firat
Abstract:
In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics…
▽ More
In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clip** and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid -- or navigate out of -- regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.
△ Less
Submitted 8 October, 2021;
originally announced October 2021.
-
Pre-trained Gaussian processes for Bayesian optimization
Authors:
Zi Wang,
George E. Dahl,
Kevin Swersky,
Chansoo Lee,
Zelda Mariet,
Zachary Nado,
Justin Gilmer,
Jasper Snoek,
Zoubin Ghahramani
Abstract:
Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs o…
▽ More
Bayesian optimization (BO) has become a popular strategy for global optimization of many expensive real-world functions. Contrary to a common belief that BO is suited to optimizing black-box functions, it actually requires domain knowledge on characteristics of those functions to deploy BO successfully. Such domain knowledge often manifests in Gaussian process priors that specify initial beliefs on functions. However, even with expert knowledge, it is not an easy task to select a prior. This is especially true for hyperparameter tuning problems on complex machine learning models, where landscapes of tuning objectives are often difficult to comprehend. We seek an alternative practice for setting these functional priors. In particular, we consider the scenario where we have data from similar functions that allow us to pre-train a tighter distribution a priori. Theoretically, we show a bounded regret of BO with pre-trained priors. To verify our approach in realistic model training setups, we collected a large multi-task hyperparameter tuning dataset by training tens of thousands of configurations of near-state-of-the-art models on popular image and text datasets, as well as a protein sequence dataset. Our results show that on average, our method is able to locate good hyperparameters at least 3 times more efficiently than the best competing methods.
△ Less
Submitted 6 July, 2022; v1 submitted 16 September, 2021;
originally announced September 2021.
-
What Will it Take to Fix Benchmarking in Natural Language Understanding?
Authors:
Samuel R. Bowman,
George E. Dahl
Abstract:
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform…
▽ More
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.
△ Less
Submitted 15 October, 2021; v1 submitted 5 April, 2021;
originally announced April 2021.
-
A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes
Authors:
Zachary Nado,
Justin M. Gilmer,
Christopher J. Shallue,
Rohan Anil,
George E. Dahl
Abstract:
Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether L…
▽ More
Recently the LARS and LAMB optimizers have been proposed for training neural networks faster using large batch sizes. LARS and LAMB add layer-wise normalization to the update rules of Heavy-ball momentum and Adam, respectively, and have become popular in prominent benchmarks and deep learning libraries. However, without fair comparisons to standard optimizers, it remains an open question whether LARS and LAMB have any benefit over traditional, generic algorithms. In this work we demonstrate that standard optimization algorithms such as Nesterov momentum and Adam can match or exceed the results of LARS and LAMB at large batch sizes. Our results establish new, stronger baselines for future comparisons at these batch sizes and shed light on the difficulties of comparing optimizers for neural network training more generally.
△ Less
Submitted 9 June, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
Sign-restricted matrices of $0$'s, $1$'s, and $-1$'s
Authors:
Richard A. Brualdi,
Geir Dahl
Abstract:
We study {\em sign-restricted matrices} (SRMs), a class of rectangular $(0, \pm 1)$-matrices generalizing the alternating sign matrices (ASMs). In an SRM each partial column sum, starting from row 1, equals 0 or 1, and each partial row sum, starting from column 1, is nonnegative. We determine the maximum number of nonzeros in SRMs and characterize the possible row and column sum vectors. Moreover,…
▽ More
We study {\em sign-restricted matrices} (SRMs), a class of rectangular $(0, \pm 1)$-matrices generalizing the alternating sign matrices (ASMs). In an SRM each partial column sum, starting from row 1, equals 0 or 1, and each partial row sum, starting from column 1, is nonnegative. We determine the maximum number of nonzeros in SRMs and characterize the possible row and column sum vectors. Moreover, a number of results on interchange operations are shown, both for SRMs and, more generally, for $(0, \pm 1)$-matrices. The Bruhat order on ASMs can be extended to SRMs with the result a distributive lattice. Also, we study polytopes associated with SRMs and some relates decompositions.
△ Less
Submitted 11 January, 2021;
originally announced January 2021.
-
Convex $(0,1)$-Matrices and Their Epitopes
Authors:
Richard A. Brualdi,
Geir Dahl
Abstract:
We investigate $(0,1)$-matrices that are {\em convex}, which means that the ones are consecutive in every row and column. These matrices occur in discrete tomography. The notion of ranked essential sets, known for permutation matrices, is extended to convex sets. We show a number of results for the class $\mc{C}(R,S)$ of convex matrices with given row and column sum vectors $R$ and $S$. Also, it i…
▽ More
We investigate $(0,1)$-matrices that are {\em convex}, which means that the ones are consecutive in every row and column. These matrices occur in discrete tomography. The notion of ranked essential sets, known for permutation matrices, is extended to convex sets. We show a number of results for the class $\mc{C}(R,S)$ of convex matrices with given row and column sum vectors $R$ and $S$. Also, it is shown that the ranked essential set uniquely determines a matrix in $\mc{C}(R,S)$.
△ Less
Submitted 11 January, 2021;
originally announced January 2021.
-
Diagonal Sums of Doubly Stochastic Matrices
Authors:
Richard A. Brualdi,
Geir Dahl
Abstract:
Let $Ω_n$ denote the class of $n \times n$ doubly stochastic matrices (each such matrix is entrywise nonnegative and every row and column sum is 1). We study the diagonals of matrices in $Ω_n$. The main question is: which $A \in Ω_n$ are such that the diagonals in $A$ that avoid the zeros of $A$ all have the same sum of their entries. We give a characterization of such matrices, and establish seve…
▽ More
Let $Ω_n$ denote the class of $n \times n$ doubly stochastic matrices (each such matrix is entrywise nonnegative and every row and column sum is 1). We study the diagonals of matrices in $Ω_n$. The main question is: which $A \in Ω_n$ are such that the diagonals in $A$ that avoid the zeros of $A$ all have the same sum of their entries. We give a characterization of such matrices, and establish several classes of patterns of such matrices.
△ Less
Submitted 11 January, 2021;
originally announced January 2021.
-
On Kemeny's constant for trees with fixed order and diameter
Authors:
Lorenzo Ciardo,
Geir Dahl,
Steve Kirkland
Abstract:
Kemeny's constant $κ(G)$ of a connected graph $G$ is a measure of the expected transit time for the random walk associated with $G$. In the current work, we consider the case when $G$ is a tree, and, in this setting, we provide lower and upper bounds for $κ(G)$ in terms of the order $n$ and diameter $δ$ of $G$ by using two different techniques. The lower bound is given as Kemeny's constant of a pa…
▽ More
Kemeny's constant $κ(G)$ of a connected graph $G$ is a measure of the expected transit time for the random walk associated with $G$. In the current work, we consider the case when $G$ is a tree, and, in this setting, we provide lower and upper bounds for $κ(G)$ in terms of the order $n$ and diameter $δ$ of $G$ by using two different techniques. The lower bound is given as Kemeny's constant of a particular caterpillar tree and, as a consequence, it is sharp. The upper bound is found via induction, by repeatedly removing pendent vertices from $G$. By considering a specific family of trees - the broom-stars - we show that the upper bound is asymptotically sharp.
△ Less
Submitted 18 March, 2020;
originally announced March 2020.
-
A deep learning based tool for automatic brain extraction from functional magnetic resonance images in rodents
Authors:
Sidney Pontes-Filho,
Annelene Gulden Dahl,
Stefano Nichele,
Gustavo Borges Moreno e Mello
Abstract:
Removing skull artifacts from functional magnetic images (fMRI) is a well understood and frequently encountered problem. Because the fMRI field has grown mostly due to human studies, many new tools were developed to handle human data. Nonetheless, these tools are not equally useful to handle the data derived from animal studies, especially from rodents. This represents a major problem to the field…
▽ More
Removing skull artifacts from functional magnetic images (fMRI) is a well understood and frequently encountered problem. Because the fMRI field has grown mostly due to human studies, many new tools were developed to handle human data. Nonetheless, these tools are not equally useful to handle the data derived from animal studies, especially from rodents. This represents a major problem to the field because rodent studies generate larger datasets from larger populations, which implies that preprocessing these images manually to remove the skull becomes a bottleneck in the data analysis pipeline. In this study, we address this problem by implementing a neural network based method that uses a U-Net architecture to segment the brain area into a mask and removing the skull and other tissues from the image. We demonstrate several strategies to speed up the process of generating the training dataset using watershedding and several strategies for data augmentation that allowed to train faster the U-Net to perform the segmentation. Finally, we deployed the trained network freely available.
△ Less
Submitted 5 December, 2019; v1 submitted 3 December, 2019;
originally announced December 2019.
-
On Empirical Comparisons of Optimizers for Deep Learning
Authors:
Dami Choi,
Christopher J. Shallue,
Zachary Nado,
Jaehoon Lee,
Chris J. Maddison,
George E. Dahl
Abstract:
Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that t…
▽ More
Selecting an optimizer is a central step in the contemporary deep learning pipeline. In this paper, we demonstrate the sensitivity of optimizer comparisons to the hyperparameter tuning protocol. Our findings suggest that the hyperparameter search space may be the single most important factor explaining the rankings obtained by recent empirical comparisons in the literature. In fact, we show that these results can be contradicted when hyperparameter search spaces are changed. As tuning effort grows without bound, more general optimizers should never underperform the ones they can approximate (i.e., Adam should never perform worse than momentum), but recent attempts to compare optimizers either assume these inclusion relationships are not practically relevant or restrict the hyperparameters in ways that break the inclusions. In our experiments, we find that inclusion relationships between optimizers matter in practice and always predict optimizer comparisons. In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent. We also report practical tips around tuning often ignored hyperparameters of adaptive gradient methods and raise concerns about fairly benchmarking optimizers for neural network training.
△ Less
Submitted 15 June, 2020; v1 submitted 11 October, 2019;
originally announced October 2019.
-
Permutation Matrices, Their Discrete Derivatives and Extremal Properties
Authors:
Richard A. Brualdi,
Geir Dahl
Abstract:
For a permutation $π$, and the corresponding permutation matrix, we introduce the notion of {\em discrete derivative}, obtained by taking differences of successive entries in $π$. We characterize the possible derivatives of permutations, and consider questions for permutations with certain properties satisfied by the derivative. For instance, we consider permutations with distinct derivatives, and…
▽ More
For a permutation $π$, and the corresponding permutation matrix, we introduce the notion of {\em discrete derivative}, obtained by taking differences of successive entries in $π$. We characterize the possible derivatives of permutations, and consider questions for permutations with certain properties satisfied by the derivative. For instance, we consider permutations with distinct derivatives, and the relationship to so-called Costas arrays.
△ Less
Submitted 10 August, 2019;
originally announced August 2019.
-
Faster Neural Network Training with Data Echoing
Authors:
Dami Choi,
Alexandre Passos,
Christopher J. Shallue,
George E. Dahl
Abstract:
In the twilight of Moore's law, GPUs and other specialized hardware accelerators have dramatically sped up neural network training. However, earlier stages of the training pipeline, such as disk I/O and data preprocessing, do not run on accelerators. As accelerators continue to improve, these earlier stages will increasingly become the bottleneck. In this paper, we introduce "data echoing," which…
▽ More
In the twilight of Moore's law, GPUs and other specialized hardware accelerators have dramatically sped up neural network training. However, earlier stages of the training pipeline, such as disk I/O and data preprocessing, do not run on accelerators. As accelerators continue to improve, these earlier stages will increasingly become the bottleneck. In this paper, we introduce "data echoing," which reduces the total computation used by earlier pipeline stages and speeds up training whenever computation upstream from accelerators dominates the training time. Data echoing reuses (or "echoes") intermediate outputs from earlier pipeline stages in order to reclaim idle capacity. We investigate the behavior of different data echoing algorithms on various workloads, for various amounts of echoing, and for various batch sizes. We find that in all settings, at least one data echoing algorithm can match the baseline's predictive performance using less upstream computation. We measured a factor of 3.25 decrease in wall-clock time for ResNet-50 on ImageNet when reading training data over a network.
△ Less
Submitted 7 May, 2020; v1 submitted 11 July, 2019;
originally announced July 2019.
-
Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Authors:
Guodong Zhang,
Lala Li,
Zachary Nado,
James Martens,
Sushant Sachdeva,
George E. Dahl,
Christopher J. Shallue,
Roger Grosse
Abstract:
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noi…
▽ More
Increasing the batch size is a popular way to speed up neural network training, but beyond some critical batch size, larger batch sizes yield diminishing returns. In this work, we study how the critical batch size changes based on properties of the optimization algorithm, including acceleration and preconditioning, through two different lenses: large scale experiments, and analysis of a simple noisy quadratic model (NQM). We experimentally demonstrate that optimization algorithms that employ preconditioning, specifically Adam and K-FAC, result in much larger critical batch sizes than stochastic gradient descent with momentum. We also demonstrate that the NQM captures many of the essential features of real neural network training, despite being drastically simpler to work with. The NQM predicts our results with preconditioned optimizers, previous results with accelerated gradient descent, and other results around optimal learning rates and large batch training, making it a useful tool to generate testable predictions about neural network optimization.
△ Less
Submitted 28 October, 2019; v1 submitted 9 July, 2019;
originally announced July 2019.
-
Measuring the Effects of Data Parallelism on Neural Network Training
Authors:
Christopher J. Shallue,
Jaehoon Lee,
Joseph Antognini,
Jascha Sohl-Dickstein,
Roy Frostig,
George E. Dahl
Abstract:
Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by…
▽ More
Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by the number of steps necessary to reach a goal out-of-sample error. We study how this relationship varies with the training algorithm, model, and data set, and find extremely large variation between workloads. Along the way, we show that disagreements in the literature on how batch size affects model quality can largely be explained by differences in metaparameter tuning and compute budgets at different batch sizes. We find no evidence that larger batch sizes degrade out-of-sample performance. Finally, we discuss the implications of our results on efforts to train neural networks much faster in the future. Our experimental data is publicly available as a database of 71,638,836 loss measurements taken over the course of training for 168,160 individual models across 35 workloads.
△ Less
Submitted 18 July, 2019; v1 submitted 8 November, 2018;
originally announced November 2018.
-
The Importance of Generation Order in Language Modeling
Authors:
Nicolas Ford,
Daniel Duckworth,
Mohammad Norouzi,
George E. Dahl
Abstract:
Neural language models are a critical component of state-of-the-art systems for machine translation, summarization, audio transcription, and other tasks. These language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. This paper studies the influence of token generation order on model quality via a novel two-pass language model th…
▽ More
Neural language models are a critical component of state-of-the-art systems for machine translation, summarization, audio transcription, and other tasks. These language models are almost universally autoregressive in nature, generating sentences one token at a time from left to right. This paper studies the influence of token generation order on model quality via a novel two-pass language model that produces partially-filled sentence "templates" and then fills in missing tokens. We compare various strategies for structuring these two passes and observe a surprisingly large variation in model quality. We find the most effective strategy generates function words in the first pass followed by content words in the second. We believe these experimental results justify a more extensive investigation of generation order for neural language models.
△ Less
Submitted 23 August, 2018;
originally announced August 2018.
-
Peptide-Spectra Matching from Weak Supervision
Authors:
Samuel S. Schoenholz,
Sean Hackett,
Laura Deming,
Eugene Melamud,
Navdeep Jaitly,
Fiona McAllister,
Jonathon O'Brien,
George Dahl,
Bryson Bennett,
Andrew M. Dai,
Daphne Koller
Abstract:
As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets map** inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the pro…
▽ More
As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets map** inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-the-art grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems.
△ Less
Submitted 22 August, 2018; v1 submitted 20 August, 2018;
originally announced August 2018.
-
Motivating the Rules of the Game for Adversarial Example Research
Authors:
Justin Gilmer,
Ryan P. Adams,
Ian Goodfellow,
David Andersen,
George E. Dahl
Abstract:
Advances in machine learning have led to broad deployment of systems with impressive performance on important problems. Nonetheless, these systems can be induced to make errors on data that are surprisingly similar to examples the learned system handles correctly. The existence of these errors raises a variety of questions about out-of-sample generalization and whether bad actors might use such ex…
▽ More
Advances in machine learning have led to broad deployment of systems with impressive performance on important problems. Nonetheless, these systems can be induced to make errors on data that are surprisingly similar to examples the learned system handles correctly. The existence of these errors raises a variety of questions about out-of-sample generalization and whether bad actors might use such examples to abuse deployed systems. As a result of these security concerns, there has been a flurry of recent papers proposing algorithms to defend against such malicious perturbations of correctly handled examples. It is unclear how such misclassifications represent a different kind of security problem than other errors, or even other attacker-produced examples that have no specific relationship to an uncorrupted input. In this paper, we argue that adversarial example defense papers have, to date, mostly considered abstract, toy games that do not relate to any specific security concern. Furthermore, defense papers have not yet precisely described all the abilities and limitations of attackers that would be relevant in practical security. Towards this end, we establish a taxonomy of motivations, constraints, and abilities for more plausible adversaries. Finally, we provide a series of recommendations outlining a path forward for future work to more clearly articulate the threat model and perform more meaningful evaluation.
△ Less
Submitted 19 July, 2018; v1 submitted 17 July, 2018;
originally announced July 2018.
-
Embedding Text in Hyperbolic Spaces
Authors:
Bhuwan Dhingra,
Christopher J. Shallue,
Mohammad Norouzi,
Andrew M. Dai,
George E. Dahl
Abstract:
Natural language text exhibits hierarchical structure in a variety of respects. Ideally, we could incorporate our prior knowledge of this hierarchical structure into unsupervised learning algorithms that work on text data. Recent work by Nickel & Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results when emb…
▽ More
Natural language text exhibits hierarchical structure in a variety of respects. Ideally, we could incorporate our prior knowledge of this hierarchical structure into unsupervised learning algorithms that work on text data. Recent work by Nickel & Kiela (2017) proposed using hyperbolic instead of Euclidean embedding spaces to represent hierarchical data and demonstrated encouraging results when embedding graphs. In this work, we extend their method with a re-parameterization technique that allows us to learn hyperbolic embeddings of arbitrarily parameterized objects. We apply this framework to learn word and sentence embeddings in hyperbolic space in an unsupervised manner from text corpora. The resulting embeddings seem to encode certain intuitive notions of hierarchy, such as word-context frequency and phrase constituency. However, the implicit continuous hierarchy in the learned hyperbolic space makes interrogating the model's learned hierarchies more difficult than for models that learn explicit edges between items. The learned hyperbolic embeddings show improvements over Euclidean embeddings in some -- but not all -- downstream tasks, suggesting that hierarchical organization is more useful for some tasks than others.
△ Less
Submitted 11 June, 2018;
originally announced June 2018.
-
Relational inductive biases, deep learning, and graph networks
Authors:
Peter W. Battaglia,
Jessica B. Hamrick,
Victor Bapst,
Alvaro Sanchez-Gonzalez,
Vinicius Zambaldi,
Mateusz Malinowski,
Andrea Tacchetti,
David Raposo,
Adam Santoro,
Ryan Faulkner,
Caglar Gulcehre,
Francis Song,
Andrew Ballard,
Justin Gilmer,
George Dahl,
Ashish Vaswani,
Kelsey Allen,
Charles Nash,
Victoria Langston,
Chris Dyer,
Nicolas Heess,
Daan Wierstra,
Pushmeet Kohli,
Matt Botvinick,
Oriol Vinyals
, et al. (2 additional authors not shown)
Abstract:
Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, rema…
▽ More
Artificial intelligence (AI) has undergone a renaissance recently, making major progress in key domains such as vision, language, control, and decision-making. This has been due, in part, to cheap data and cheap compute resources, which have fit the natural strengths of deep learning. However, many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current approaches. In particular, generalizing beyond one's experiences--a hallmark of human intelligence from infancy--remains a formidable challenge for modern AI.
The following is part position paper, part review, and part unification. We argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Just as biology uses nature and nurture cooperatively, we reject the false choice between "hand-engineering" and "end-to-end" learning, and instead advocate for an approach which benefits from their complementary strengths. We explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them. We present a new building block for the AI toolkit with a strong relational inductive bias--the graph network--which generalizes and extends various approaches for neural networks that operate on graphs, and provides a straightforward interface for manipulating structured knowledge and producing structured behaviors. We discuss how graph networks can support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. As a companion to this paper, we have released an open-source software library for building graph networks, with demonstrations of how to use them in practice.
△ Less
Submitted 17 October, 2018; v1 submitted 4 June, 2018;
originally announced June 2018.
-
New Bounds for the Signless Laplacian Spread
Authors:
Enide Andrade,
Geir Dahl,
Laura Leal,
María Robbiano
Abstract:
Let $G$ be a simple graph. The signless Laplacian spread of $G$ is defined as the maximum distance of pairs of its signless Laplacian eigenvalues. This paper establishes some new bounds, both lower and upper, for the signless Laplacian spread. Several of these bounds depend on invariant parameters of the graph. We also use a minmax principle to find several lower bounds for this spectral invariant…
▽ More
Let $G$ be a simple graph. The signless Laplacian spread of $G$ is defined as the maximum distance of pairs of its signless Laplacian eigenvalues. This paper establishes some new bounds, both lower and upper, for the signless Laplacian spread. Several of these bounds depend on invariant parameters of the graph. We also use a minmax principle to find several lower bounds for this spectral invariant.
△ Less
Submitted 30 May, 2018;
originally announced May 2018.
-
Parallel Architecture and Hyperparameter Search via Successive Halving and Classification
Authors:
Manoj Kumar,
George E. Dahl,
Vijay Vasudevan,
Mohammad Norouzi
Abstract:
We present a simple and powerful algorithm for parallel black box optimization called Successive Halving and Classification (SHAC). The algorithm operates in $K$ stages of parallel function evaluations and trains a cascade of binary classifiers to iteratively cull the undesirable regions of the search space. SHAC is easy to implement, requires no tuning of its own configuration parameters, is inva…
▽ More
We present a simple and powerful algorithm for parallel black box optimization called Successive Halving and Classification (SHAC). The algorithm operates in $K$ stages of parallel function evaluations and trains a cascade of binary classifiers to iteratively cull the undesirable regions of the search space. SHAC is easy to implement, requires no tuning of its own configuration parameters, is invariant to the scale of the objective function and can be built using any choice of binary classifier. We adopt tree-based classifiers within SHAC and achieve competitive performance against several strong baselines for optimizing synthetic functions, hyperparameters and architectures.
△ Less
Submitted 25 May, 2018;
originally announced May 2018.
-
Large scale distributed neural network training through online distillation
Authors:
Rohan Anil,
Gabriel Pereyra,
Alexandre Passos,
Robert Ormandi,
George E. Dahl,
Geoffrey E. Hinton
Abstract:
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward…
▽ More
Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6\times 10^{11}$ tokens and based on the Common Crawl repository of web data.
△ Less
Submitted 20 August, 2020; v1 submitted 9 April, 2018;
originally announced April 2018.
-
Alternating Sign Matrices and Hypermatrices, and a Generalization of Latin Square
Authors:
Richard A. Brualdi,
Geir Dahl
Abstract:
An alternating sign matrix, or ASM, is a $(0, \pm 1)$-matrix where the nonzero entries in each row and column alternate in sign. We generalize this notion to hypermatrices: an $n\times n\times n$ hypermatrix $A=[a_{ijk}]$ is an {\em alternating sign hypermatrix}, or ASHM, if each of its planes, obtained by fixing one of the three indices, is an ASM. Several results concerning ASHMs are shown, such…
▽ More
An alternating sign matrix, or ASM, is a $(0, \pm 1)$-matrix where the nonzero entries in each row and column alternate in sign. We generalize this notion to hypermatrices: an $n\times n\times n$ hypermatrix $A=[a_{ijk}]$ is an {\em alternating sign hypermatrix}, or ASHM, if each of its planes, obtained by fixing one of the three indices, is an ASM. Several results concerning ASHMs are shown, such as finding the maximum number of nonzeros of an $n\times n\times n$ ASHM, and properties related to Latin squares. Moreover, we investigate completion problems, in which one asks if a subhypermatrix can be completed (extended) into an ASHM. We show several theorems of this type.
△ Less
Submitted 25 April, 2017;
originally announced April 2017.
-
Neural Message Passing for Quantum Chemistry
Authors:
Justin Gilmer,
Samuel S. Schoenholz,
Patrick F. Riley,
Oriol Vinyals,
George E. Dahl
Abstract:
Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At…
▽ More
Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels.
△ Less
Submitted 12 June, 2017; v1 submitted 4 April, 2017;
originally announced April 2017.
-
Detecting Cancer Metastases on Gigapixel Pathology Images
Authors:
Yun Liu,
Krishna Gadepalli,
Mohammad Norouzi,
George E. Dahl,
Timo Kohlberger,
Aleksey Boyko,
Subhashini Venugopalan,
Aleksei Timofeev,
Philip Q. Nelson,
Greg S. Corrado,
Jason D. Hipp,
Lily Peng,
Martin C. Stumpe
Abstract:
Each year, the treatment decisions for more than 230,000 breast cancer patients in the U.S. hinge on whether the cancer has metastasized away from the breast. Metastasis detection is currently performed by pathologists reviewing large expanses of biological tissues. This process is labor intensive and error-prone. We present a framework to automatically detect and localize tumors as small as 100 x…
▽ More
Each year, the treatment decisions for more than 230,000 breast cancer patients in the U.S. hinge on whether the cancer has metastasized away from the breast. Metastasis detection is currently performed by pathologists reviewing large expanses of biological tissues. This process is labor intensive and error-prone. We present a framework to automatically detect and localize tumors as small as 100 x 100 pixels in gigapixel microscopy images sized 100,000 x 100,000 pixels. Our method leverages a convolutional neural network (CNN) architecture and obtains state-of-the-art results on the Camelyon16 dataset in the challenging lesion-level tumor detection task. At 8 false positives per image, we detect 92.4% of the tumors, relative to 82.7% by the previous best automated approach. For comparison, a human pathologist attempting exhaustive search achieved 73.2% sensitivity. We achieve image-level AUC scores above 97% on both the Camelyon16 test set and an independent set of 110 slides. In addition, we discover that two slides in the Camelyon16 training set were erroneously labeled normal. Our approach could considerably reduce false negative rates in metastasis detection.
△ Less
Submitted 7 March, 2017; v1 submitted 3 March, 2017;
originally announced March 2017.
-
Machine learning prediction errors better than DFT accuracy
Authors:
Felix A. Faber,
Luke Hutchison,
Bing Huang,
Justin Gilmer,
Samuel S. Schoenholz,
George E. Dahl,
Oriol Vinyals,
Steven Kearnes,
Patrick F. Riley,
O. Anatole von Lilienfeld
Abstract:
We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k…
▽ More
We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural net works, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions deviate from DFT less than DFT deviates from experiment for all properties. Furthermore, our out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. Our findings suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available.
△ Less
Submitted 4 June, 2017; v1 submitted 17 February, 2017;
originally announced February 2017.
-
Measurement of the Yb I $^1S_0 - ^1P_1$ transition frequency at 399 nm using an optical frequency comb
Authors:
Michaela Kleinert,
M. E. Gold Dahl,
Scott D. Bergeson
Abstract:
We determine the frequency of the Yb I $^1S_0 - ^1P_1$ transition at 399 nm using an optical frequency comb. Although this transition was measured previously using an optical transfer cavity [D. Das et al., Phys. Rev. A 72, 032506 (2005)], recent work has uncovered significant errors in that method. We compare our result of 751 526 533.49 $\pm$ 0.33 MHz for the Yb-174 isotope with those from the l…
▽ More
We determine the frequency of the Yb I $^1S_0 - ^1P_1$ transition at 399 nm using an optical frequency comb. Although this transition was measured previously using an optical transfer cavity [D. Das et al., Phys. Rev. A 72, 032506 (2005)], recent work has uncovered significant errors in that method. We compare our result of 751 526 533.49 $\pm$ 0.33 MHz for the Yb-174 isotope with those from the literature and discuss observed differences. We verify the correctness of our method by measuring the frequencies of well-known transitions in Rb and Cs, and by demonstrating proper control of systematic errors in both laser metrology and atomic spectroscopy. We also demonstrate the effect of quantum interference due to hyperfine structure in a divalent atomic system and present isotope shift measurements for all stable isotopes.
△ Less
Submitted 2 December, 2016; v1 submitted 7 September, 2016;
originally announced September 2016.
-
Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes
Authors:
Ryan Prescott Adams,
George E. Dahl,
Iain Murray
Abstract:
Probabilistic matrix factorization (PMF) is a powerful method for modeling data associ- ated with pairwise relationships, Finding use in collaborative Filtering, computational bi- ology, and document analysis, among other areas. In many domains, there are additional covariates that can assist in prediction. For example, when modeling movie ratings, we might know when the rating occurred, where the…
▽ More
Probabilistic matrix factorization (PMF) is a powerful method for modeling data associ- ated with pairwise relationships, Finding use in collaborative Filtering, computational bi- ology, and document analysis, among other areas. In many domains, there are additional covariates that can assist in prediction. For example, when modeling movie ratings, we might know when the rating occurred, where the user lives, or what actors appear in the movie. It is difficult, however, to incorporate this side information into the PMF model. We propose a framework for incorporating side information by coupling together multi- ple PMF problems via Gaussian process priors. We replace scalar latent features with func- tions that vary over the covariate space. The GP priors on these functions require them to vary smoothly and share information. We apply this new method to predict the scores of professional basketball games, where side information about the venue and date of the game are relevant for the outcome.
△ Less
Submitted 9 August, 2014;
originally announced August 2014.
-
Multi-task Neural Networks for QSAR Predictions
Authors:
George E. Dahl,
Navdeep Jaitly,
Ruslan Salakhutdinov
Abstract:
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approache…
▽ More
Although artificial neural networks have occasionally been used for Quantitative Structure-Activity/Property Relationship (QSAR/QSPR) studies in the past, the literature has of late been dominated by other machine learning techniques such as random forests. However, a variety of new neural net techniques along with successful applications in other domains have renewed interest in network approaches. In this work, inspired by the winning team's use of neural networks in a recent QSAR competition, we used an artificial neural network to learn a function that predicts activities of compounds for multiple assays at the same time. We conducted experiments leveraging recent methods for dealing with overfitting in neural networks as well as other tricks from the neural networks literature. We compared our methods to alternative methods reported to perform well on these tasks and found that our neural net methods provided superior performance.
△ Less
Submitted 4 June, 2014;
originally announced June 2014.
-
Improvements to deep convolutional neural networks for LVCSR
Authors:
Tara N. Sainath,
Brian Kingsbury,
Abdel-rahman Mohamed,
George E. Dahl,
George Saon,
Hagen Soltau,
Tomas Beran,
Aleksandr Y. Aravkin,
Bhuvana Ramabhadran
Abstract:
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further imp…
▽ More
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Neural Networks (DNN), as they are able to better reduce spectral variation in the input signal. This has also been confirmed experimentally, with CNNs showing improvements in word error rate (WER) between 4-12% relative compared to DNNs across a variety of LVCSR tasks. In this paper, we describe different methods to further improve CNN performance. First, we conduct a deep analysis comparing limited weight sharing and full weight sharing with state-of-the-art features. Second, we apply various pooling strategies that have shown improvements in computer vision to an LVCSR speech task. Third, we introduce a method to effectively incorporate speaker adaptation, namely fMLLR, into log-mel features. Fourth, we introduce an effective strategy to use dropout during Hessian-free sequence training. We find that with these improvements, particularly with fMLLR and dropout, we are able to achieve an additional 2-3% relative improvement in WER on a 50-hour Broadcast News task over our previous best CNN baseline. On a larger 400-hour BN task, we find an additional 4-5% relative improvement over our previous best CNN baseline.
△ Less
Submitted 10 December, 2013; v1 submitted 5 September, 2013;
originally announced September 2013.
-
Subdivision schemes, network flows and linear optimization
Authors:
Maria Charina,
Geir Dahl
Abstract:
We link regularity and smoothness analysis of multivariate vector subdivision schemes with network flow theory and with special linear optimization problems. This connection allows us to prove the existence of what we call optimal difference masks that posses crucial properties unifying the regularity analysis of univariate and multivariate subdivision schemes. We also provide efficient optimizati…
▽ More
We link regularity and smoothness analysis of multivariate vector subdivision schemes with network flow theory and with special linear optimization problems. This connection allows us to prove the existence of what we call optimal difference masks that posses crucial properties unifying the regularity analysis of univariate and multivariate subdivision schemes. We also provide efficient optimization algorithms for construction of such optimal masks. Integrality of the corresponding optimal values leads to purely analytic proofs of $C^k-$regularity of subdivision.
△ Less
Submitted 10 February, 2015; v1 submitted 4 June, 2013;
originally announced June 2013.
-
Training Restricted Boltzmann Machines on Word Observations
Authors:
George E. Dahl,
Ryan P. Adams,
Hugo Larochelle
Abstract:
The restricted Boltzmann machine (RBM) is a flexible tool for modeling complex data, however there have been significant computational difficulties in using RBMs to model high-dimensional multinomial observations. In natural language processing applications, words are naturally modeled by K-ary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundreds o…
▽ More
The restricted Boltzmann machine (RBM) is a flexible tool for modeling complex data, however there have been significant computational difficulties in using RBMs to model high-dimensional multinomial observations. In natural language processing applications, words are naturally modeled by K-ary discrete distributions, where K is determined by the vocabulary size and can easily be in the hundreds of thousands. The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue by employing a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K. We demonstrate the success of our approach by training RBMs on hundreds of millions of word n-grams using larger vocabularies than previously feasible and using the learned features to improve performance on chunking and sentiment classification tasks, achieving state-of-the-art results on the latter.
△ Less
Submitted 5 July, 2012; v1 submitted 25 February, 2012;
originally announced February 2012.
-
Quantum Strategies
Authors:
Gordon B. Dahl,
Steven E. Landsburg
Abstract:
We investigate the consequences of allowing players to adopt strategies which take advantage of quantum randomization devices. In games of full information, the resulting equilibria are always correlated equilibria, but not all correlated equilibria appear as quantum equilibria. The classical and quantum theories diverge further in games of private information. In the quantum context, we show that…
▽ More
We investigate the consequences of allowing players to adopt strategies which take advantage of quantum randomization devices. In games of full information, the resulting equilibria are always correlated equilibria, but not all correlated equilibria appear as quantum equilibria. The classical and quantum theories diverge further in games of private information. In the quantum context, we show that Kuhn's equivalence between behavioral and mixed strategies breaks down. As a result, quantum technology allows players to achieve outcomes that would not be achievable with any classical technology short of direct communication; in particular they do not occur as correlated equilibria.
In general, in games of private information, quantum technology allows players to achieve outcomes that are Pareto superior to any classical correlated equilibrium, but not necessarily Pareto optimal. A simple economic example illustrates these points.
△ Less
Submitted 20 October, 2011;
originally announced October 2011.
-
Incorporating Side Information in Probabilistic Matrix Factorization with Gaussian Processes
Authors:
Ryan Prescott Adams,
George E. Dahl,
Iain Murray
Abstract:
Probabilistic matrix factorization (PMF) is a powerful method for modeling data associated with pairwise relationships, finding use in collaborative filtering, computational biology, and document analysis, among other areas. In many domains, there is additional information that can assist in prediction. For example, when modeling movie ratings, we might know when the rating occurred, where the u…
▽ More
Probabilistic matrix factorization (PMF) is a powerful method for modeling data associated with pairwise relationships, finding use in collaborative filtering, computational biology, and document analysis, among other areas. In many domains, there is additional information that can assist in prediction. For example, when modeling movie ratings, we might know when the rating occurred, where the user lives, or what actors appear in the movie. It is difficult, however, to incorporate this side information into the PMF model. We propose a framework for incorporating side information by coupling together multiple PMF problems via Gaussian process priors. We replace scalar latent features with functions that vary over the space of side information. The GP priors on these functions require them to vary smoothly and share information. We successfully use this new method to predict the scores of professional basketball games, where side information about the venue and date of the game are relevant for the outcome.
△ Less
Submitted 25 March, 2010;
originally announced March 2010.