-
Universal characteristics of deep neural network loss surfaces from random matrix theory
Authors:
Nicholas P Baskerville,
Jonathan P Keating,
Francesco Mezzadri,
Joseph Najnudel,
Diego Granziol
Abstract:
This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networ…
▽ More
This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local statistics to derive practical implications for deep neural networks based on a realistic model of their Hessians. In particular we derive universal aspects of outliers in the spectra of deep neural networks and demonstrate the important role of random matrix local laws in popular pre-conditioning gradient descent algorithms. We also present insights into deep neural network loss surfaces from quite general arguments based on tools from statistical physics and random matrix theory.
△ Less
Submitted 20 June, 2022; v1 submitted 17 May, 2022;
originally announced May 2022.
-
Ranker-agnostic Contextual Position Bias Estimation
Authors:
Oriol Barbany Mayor,
Vito Bellini,
Alexander Buchholz,
Giuseppe Di Benedetto,
Diego Marco Granziol,
Matteo Ruffini,
Yannik Stein
Abstract:
Learning-to-rank (LTR) algorithms are ubiquitous and necessary to explore the extensive catalogs of media providers. To avoid the user examining all the results, its preferences are used to provide a subset of relatively small size. The user preferences can be inferred from the interactions with the presented content if explicit ratings are unavailable. However, directly using implicit feedback ca…
▽ More
Learning-to-rank (LTR) algorithms are ubiquitous and necessary to explore the extensive catalogs of media providers. To avoid the user examining all the results, its preferences are used to provide a subset of relatively small size. The user preferences can be inferred from the interactions with the presented content if explicit ratings are unavailable. However, directly using implicit feedback can lead to learning wrong relevance models and is known as biased LTR. The mismatch between implicit feedback and true relevances is due to various nuisances, with position bias one of the most relevant. Position bias models consider that the lack of interaction with a presented item is not only attributed to the item being irrelevant but because the item was not examined. This paper introduces a method for modeling the probability of an item being seen in different contexts, e.g., for different users, with a single estimator. Our suggested method, denoted as contextual (EM)-based regression, is ranker-agnostic and able to correctly learn the latent examination probabilities while only using implicit feedback. Our empirical results indicate that the method introduced in this paper outperforms other existing position bias estimators in terms of relative error when the examination probability varies across queries. Moreover, the estimated values provide a ranking performance boost when used to debias the implicit ranking data even if there is no context dependency on the examination probabilities.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
Appearance of Random Matrix Theory in Deep Learning
Authors:
Nicholas P Baskerville,
Diego Granziol,
Jonathan P Keating
Abstract:
We investigate the local spectral statistics of the loss surface Hessians of artificial neural networks, where we discover excellent agreement with Gaussian Orthogonal Ensemble statistics across several network architectures and datasets. These results shed new light on the applicability of Random Matrix Theory to modelling neural networks and suggest a previously unrecognised role for it in the s…
▽ More
We investigate the local spectral statistics of the loss surface Hessians of artificial neural networks, where we discover excellent agreement with Gaussian Orthogonal Ensemble statistics across several network architectures and datasets. These results shed new light on the applicability of Random Matrix Theory to modelling neural networks and suggest a previously unrecognised role for it in the study of loss surfaces in deep learning. Inspired by these observations, we propose a novel model for the true loss surfaces of neural networks, consistent with our observations, which allows for Hessian spectral densities with rank degeneracy and outliers, extensively observed in practice, and predicts a growing independence of loss gradients as a function of distance in weight-space. We further investigate the importance of the true loss surface in neural networks and find, in contrast to previous work, that the exponential hardness of locating the global minimum has practical consequences for achieving state of the art performance.
△ Less
Submitted 24 December, 2021; v1 submitted 12 February, 2021;
originally announced February 2021.
-
A Random Matrix Theory Approach to Dam** in Deep Learning
Authors:
Diego Granziol,
Nicholas Baskerville
Abstract:
We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise in the flattest directions of the true loss surface. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or dam** constants) serve to bias relative movement towards flat directions rela…
▽ More
We conjecture that the inherent difference in generalisation between adaptive and non-adaptive gradient methods in deep learning stems from the increased estimation noise in the flattest directions of the true loss surface. We demonstrate that typical schedules used for adaptive methods (with low numerical stability or dam** constants) serve to bias relative movement towards flat directions relative to sharp directions, effectively amplifying the noise-to-signal ratio and harming generalisation. We further demonstrate that the numerical dam** constant used in these methods can be decomposed into a learning rate reduction and linear shrinkage of the estimated curvature matrix. We then demonstrate significant generalisation improvements by increasing the shrinkage coefficient, closing the generalisation gap entirely in both logistic regression and several deep neural network experiments. Extending this line further, we develop a novel random matrix theory based dam** learner for second order optimiser inspired by linear shrinkage estimation. We experimentally demonstrate our learner to be very insensitive to the initialised value and to allow for extremely fast convergence in conjunction with continued stable training and competitive generalisation.
△ Less
Submitted 16 March, 2022; v1 submitted 15 November, 2020;
originally announced November 2020.
-
Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training
Authors:
Diego Granziol,
Stefan Zohren,
Stephen Roberts
Abstract:
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we de…
▽ More
We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size, informing practical training regimens for both stochastic gradient descent (linear scaling) and adaptive algorithms, such as Adam (square root scaling), for smooth, non-convex deep neural networks. Whilst the linear scaling for stochastic gradient descent has been derived under more restrictive conditions, which we generalise, the square root scaling rule for adaptive optimisers is, to our knowledge, completely novel. %For stochastic second-order methods and adaptive methods, we derive that the minimal dam** coefficient is proportional to the ratio of the learning rate to batch size. We validate our claims on the VGG/WideResNet architectures on the CIFAR-$100$ and ImageNet datasets. Based on our investigations of the sub-sampled Hessian we develop a stochastic Lanczos quadrature based on the fly learning rate and momentum learner, which avoids the need for expensive multiple evaluations for these key hyper-parameters and shows good preliminary results on the Pre-Residual Architecure for CIFAR-$100$.
△ Less
Submitted 5 November, 2021; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Flatness is a False Friend
Authors:
Diego Granziol
Abstract:
Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness. This implies that solutions obtained using $L2$ regu…
▽ More
Hessian based measures of flatness, such as the trace, Frobenius and spectral norms, have been argued, used and shown to relate to generalisation. In this paper we demonstrate that for feed forward neural networks under the cross entropy loss, we would expect low loss solutions with large weights to have small Hessian based measures of flatness. This implies that solutions obtained using $L2$ regularisation should in principle be sharper than those without, despite generalising better. We show this to be true for logistic regression, multi-layer perceptrons, simple convolutional, pre-activated and wide residual networks on the MNIST and CIFAR-$100$ datasets. Furthermore, we show that for adaptive optimisation algorithms using iterate averaging, on the VGG-$16$ network and CIFAR-$100$ dataset, achieve superior generalisation to SGD but are $30 \times$ sharper. This theoretical finding, along with experimental results, raises serious questions about the validity of Hessian based sharpness measures in the discussion of generalisation. We further show that the Hessian rank can be bounded by the a constant times number of neurons multiplied by the number of classes, which in practice is often a small fraction of the network parameters. This explains the curious observation that many Hessian eigenvalues are either zero or very near zero which has been reported in the literature.
△ Less
Submitted 16 June, 2020;
originally announced June 2020.
-
Beyond Random Matrix Theory for Deep Networks
Authors:
Diego Granziol
Abstract:
We investigate whether the Wigner semi-circle and Marcenko-Pastur distributions, often used for deep neural network theoretical analysis, match empirically observed spectral densities. We find that even allowing for outliers, the observed spectral shapes strongly deviate from such theoretical predictions. This raises major questions about the usefulness of these models in deep learning. We further…
▽ More
We investigate whether the Wigner semi-circle and Marcenko-Pastur distributions, often used for deep neural network theoretical analysis, match empirically observed spectral densities. We find that even allowing for outliers, the observed spectral shapes strongly deviate from such theoretical predictions. This raises major questions about the usefulness of these models in deep learning. We further show that theoretical results, such as the layered nature of critical points, are strongly dependent on the use of the exact form of these limiting spectral densities. We consider two new classes of matrix ensembles; random Wigner/Wishart ensemble products and percolated Wigner/Wishart ensembles, both of which better match observed spectra. They also give large discrete spectral peaks at the origin, providing a theoretical explanation for the observation that various optima can be connected by one dimensional of low loss values. We further show that, in the case of a random matrix product, the weight of the discrete spectral component at $0$ depends on the ratio of the dimensions of the weight matrices.
△ Less
Submitted 3 November, 2021; v1 submitted 13 June, 2020;
originally announced June 2020.
-
Iterative Averaging in the Quest for Best Test Error
Authors:
Diego Granziol,
Xingchen Wan,
Samuel Albanie,
Stephen Roberts
Abstract:
We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena \latestEdits{from our theoretical results:} (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved regularisatio…
▽ More
We analyse and explain the increased generalisation performance of iterate averaging using a Gaussian process perturbation model between the true and batch risk surface on the high dimensional quadratic. We derive three phenomena \latestEdits{from our theoretical results:} (1) The importance of combining iterate averaging (IA) with large learning rates and regularisation for improved regularisation. (2) Justification for less frequent averaging. (3) That we expect adaptive gradient methods to work equally well, or better, with iterate averaging than their non-adaptive counterparts. Inspired by these results\latestEdits{, together with} empirical investigations of the importance of appropriate regularisation for the solution diversity of the iterates, we propose two adaptive algorithms with iterate averaging. These give significantly better results compared to stochastic gradient descent (SGD), require less tuning and do not require early stop** or validation set monitoring. We showcase the efficacy of our approach on the CIFAR-10/100, ImageNet and Penn Treebank datasets on a variety of modern and classical network architectures.
△ Less
Submitted 31 October, 2021; v1 submitted 2 March, 2020;
originally announced March 2020.
-
Deep Curvature Suite
Authors:
Diego Granziol,
Xingchen Wan,
Timur Garipov
Abstract:
We present MLRG Deep Curvature suite, a PyTorch-based, open-source package for analysis and visualisation of neural network curvature and loss landscape. Despite of providing rich information into properties of neural network and useful for a various designed tasks, curvature information is still not made sufficient use for various reasons, and our method aims to bridge this gap. We present a prim…
▽ More
We present MLRG Deep Curvature suite, a PyTorch-based, open-source package for analysis and visualisation of neural network curvature and loss landscape. Despite of providing rich information into properties of neural network and useful for a various designed tasks, curvature information is still not made sufficient use for various reasons, and our method aims to bridge this gap. We present a primer, including its main practical desiderata and common misconceptions, of \textit{Lanczos algorithm}, the theoretical backbone of our package, and present a series of examples based on synthetic toy examples and realistic modern neural networks tested on CIFAR datasets, and show the superiority of our package against existing competing approaches for the similar purposes.
△ Less
Submitted 22 May, 2020; v1 submitted 20 December, 2019;
originally announced December 2019.
-
A Maximum Entropy approach to Massive Graph Spectra
Authors:
Diego Granziol,
Robin Ru,
Stefan Zohren,
Xiaowen Dong,
Michael Osborne,
Stephen Roberts
Abstract:
Graph spectral techniques for measuring graph similarity, or for learning the cluster number, require kernel smoothing. The choice of kernel function and bandwidth are typically chosen in an ad-hoc manner and heavily affect the resulting output. We prove that kernel smoothing biases the moments of the spectral density. We propose an information theoretically optimal approach to learn a smooth grap…
▽ More
Graph spectral techniques for measuring graph similarity, or for learning the cluster number, require kernel smoothing. The choice of kernel function and bandwidth are typically chosen in an ad-hoc manner and heavily affect the resulting output. We prove that kernel smoothing biases the moments of the spectral density. We propose an information theoretically optimal approach to learn a smooth graph spectral density, which fully respects the moment information. Our method's computational cost is linear in the number of edges, and hence can be applied to large networks, with millions of nodes. We apply our method to the problems to graph similarity and cluster number learning, where we outperform comparable iterative spectral approaches on synthetic and real graphs.
△ Less
Submitted 19 December, 2019;
originally announced December 2019.
-
MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning
Authors:
Diego Granziol,
Binxin Ru,
Stefan Zohren,
Xiaowen Doing,
Michael Osborne,
Stephen Roberts
Abstract:
Efficient approximation lies at the heart of large-scale machine learning problems. In this paper, we propose a novel, robust maximum entropy algorithm, which is capable of dealing with hundreds of moments and allows for computationally efficient approximations. We showcase the usefulness of the proposed method, its equivalence to constrained Bayesian variational inference and demonstrate its supe…
▽ More
Efficient approximation lies at the heart of large-scale machine learning problems. In this paper, we propose a novel, robust maximum entropy algorithm, which is capable of dealing with hundreds of moments and allows for computationally efficient approximations. We showcase the usefulness of the proposed method, its equivalence to constrained Bayesian variational inference and demonstrate its superiority over existing approaches in two applications, namely, fast log determinant estimation and information-theoretic Bayesian optimisation.
△ Less
Submitted 3 June, 2019;
originally announced June 2019.
-
Entropic Spectral Learning for Large-Scale Graphs
Authors:
Diego Granziol,
Binxin Ru,
Stefan Zohren,
Xiaowen Dong,
Michael Osborne,
Stephen Roberts
Abstract:
Graph spectra have been successfully used to classify network types, compute the similarity between graphs, and determine the number of communities in a network. For large graphs, where an eigen-decomposition is infeasible, iterative moment matched approximations to the spectra and kernel smoothing are typically used. We show that the underlying moment information is lost when using kernel smoothi…
▽ More
Graph spectra have been successfully used to classify network types, compute the similarity between graphs, and determine the number of communities in a network. For large graphs, where an eigen-decomposition is infeasible, iterative moment matched approximations to the spectra and kernel smoothing are typically used. We show that the underlying moment information is lost when using kernel smoothing. We further propose a spectral density approximation based on the method of Maximum Entropy, for which we develop a new algorithm. This method matches moments exactly and is everywhere positive. We demonstrate its effectiveness and superiority over existing approaches in learning graph spectra, via experiments on both synthetic networks, such as the Erdős-Rényi and Barabási-Albert random graphs, and real-world networks, such as the social networks for Orkut, YouTube, and Amazon from the SNAP dataset.
△ Less
Submitted 25 March, 2019; v1 submitted 18 April, 2018;
originally announced April 2018.
-
VBALD - Variational Bayesian Approximation of Log Determinants
Authors:
Diego Granziol,
Edward Wagstaff,
Bin Xin Ru,
Michael Osborne,
Stephen Roberts
Abstract:
Evaluating the log determinant of a positive definite matrix is ubiquitous in machine learning. Applications thereof range from Gaussian processes, minimum-volume ellipsoids, metric learning, kernel learning, Bayesian neural networks, Determinental Point Processes, Markov random fields to partition functions of discrete graphical models. In order to avoid the canonical, yet prohibitive, Cholesky…
▽ More
Evaluating the log determinant of a positive definite matrix is ubiquitous in machine learning. Applications thereof range from Gaussian processes, minimum-volume ellipsoids, metric learning, kernel learning, Bayesian neural networks, Determinental Point Processes, Markov random fields to partition functions of discrete graphical models. In order to avoid the canonical, yet prohibitive, Cholesky $\mathcal{O}(n^{3})$ computational cost, we propose a novel approach, with complexity $\mathcal{O}(n^{2})$, based on a constrained variational Bayes algorithm. We compare our method to Taylor, Chebyshev and Lanczos approaches and show state of the art performance on both synthetic and real-world datasets.
△ Less
Submitted 21 February, 2018;
originally announced February 2018.
-
Fast Information-theoretic Bayesian Optimisation
Authors:
Binxin Ru,
Mark McLeod,
Diego Granziol,
Michael A. Osborne
Abstract:
Information-theoretic Bayesian optimisation techniques have demonstrated state-of-the-art performance in tackling important global optimisation problems. However, current information-theoretic approaches require many approximations in implementation, introduce often-prohibitive computational overhead and limit the choice of kernels available to model the objective. We develop a fast information-th…
▽ More
Information-theoretic Bayesian optimisation techniques have demonstrated state-of-the-art performance in tackling important global optimisation problems. However, current information-theoretic approaches require many approximations in implementation, introduce often-prohibitive computational overhead and limit the choice of kernels available to model the objective. We develop a fast information-theoretic Bayesian Optimisation method, FITBO, that avoids the need for sampling the global minimiser, thus significantly reducing computational overhead. Moreover, in comparison with existing approaches, our method faces fewer constraints on kernel choice and enjoys the merits of dealing with the output space. We demonstrate empirically that FITBO inherits the performance associated with information-theoretic Bayesian optimisation, while being even faster than simpler Bayesian optimisation approaches, such as Expected Improvement.
△ Less
Submitted 6 June, 2018; v1 submitted 2 November, 2017;
originally announced November 2017.
-
Entropic Determinants
Authors:
Diego Granziol,
Stephen Roberts
Abstract:
The ability of many powerful machine learning algorithms to deal with large data sets without compromise is often hampered by computationally expensive linear algebra tasks, of which calculating the log determinant is a canonical example. In this paper we demonstrate the optimality of Maximum Entropy methods in approximating such calculations. We prove the equivalence between mean value constraint…
▽ More
The ability of many powerful machine learning algorithms to deal with large data sets without compromise is often hampered by computationally expensive linear algebra tasks, of which calculating the log determinant is a canonical example. In this paper we demonstrate the optimality of Maximum Entropy methods in approximating such calculations. We prove the equivalence between mean value constraints and sample expectations in the big data limit, that Covariance matrix eigenvalue distributions can be completely defined by moment information and that the reduction of the self entropy of a maximum entropy proposal distribution, achieved by adding more moments reduces the KL divergence between the proposal and true eigenvalue distribution. We empirically verify our results on a variety of SparseSuite matrices and establish best practices.
△ Less
Submitted 8 September, 2017;
originally announced September 2017.
-
Entropic Trace Estimates for Log Determinants
Authors:
Jack Fitzsimons,
Diego Granziol,
Kurt Cutajar,
Michael Osborne,
Maurizio Filippone,
Stephen Roberts
Abstract:
The scalable calculation of matrix determinants has been a bottleneck to the widespread application of many machine learning methods such as determinantal point processes, Gaussian processes, generalised Markov random fields, graph models and many others. In this work, we estimate log determinants under the framework of maximum entropy, given information in the form of moment constraints from stoc…
▽ More
The scalable calculation of matrix determinants has been a bottleneck to the widespread application of many machine learning methods such as determinantal point processes, Gaussian processes, generalised Markov random fields, graph models and many others. In this work, we estimate log determinants under the framework of maximum entropy, given information in the form of moment constraints from stochastic trace estimation. The estimates demonstrate a significant improvement on state-of-the-art alternative methods, as shown on a wide variety of UFL sparse matrices. By taking the example of a general Markov random field, we also demonstrate how this approach can significantly accelerate inference in large-scale learning methods involving the log determinant.
△ Less
Submitted 24 April, 2017;
originally announced April 2017.
-
An information and field theoretic approach to the grand canonical ensemble
Authors:
Diego Granziol,
Stephen Roberts
Abstract:
We present a novel derivation of the constraints required to obtain the underlying principles of statistical mechanics using a maximum entropy framework. We derive the mean value constraints by use of the central limit theorem and the scaling properties of Lagrange multipliers. We then arrive at the same result using a quantum free field theory and the Ward identities. The work provides a principl…
▽ More
We present a novel derivation of the constraints required to obtain the underlying principles of statistical mechanics using a maximum entropy framework. We derive the mean value constraints by use of the central limit theorem and the scaling properties of Lagrange multipliers. We then arrive at the same result using a quantum free field theory and the Ward identities. The work provides a principled footing for maximum entropy methods in statistical physics, adding the body of work aligned to Jaynes's vision of statistical mechanics as a form of inference rather than a physical theory dependent on ergodicity, metric transitivity and equal a priori probabilities. We show that statistical independence, in the macroscopic limit, is the unifying concept that leads to all these derivations.
△ Less
Submitted 29 March, 2017;
originally announced March 2017.