Search | arXiv e-print repository

A PAC-Bayesian Perspective on the Interpolating Information Criterion

Authors: Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney

Abstract: Deep learning is renowned for its theory-practice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice. This has been highlighted recently by the benign overfitting phenomenon: when neural networks become sufficiently large to interpolate the dataset perfectly, model performance appears to improve with increasing model size, in apparent… ▽ More Deep learning is renowned for its theory-practice gap, whereby principled theory typically fails to provide much beneficial guidance for implementation in practice. This has been highlighted recently by the benign overfitting phenomenon: when neural networks become sufficiently large to interpolate the dataset perfectly, model performance appears to improve with increasing model size, in apparent contradiction with the well-known bias-variance tradeoff. While such phenomena have proven challenging to theoretically study for general models, the recently proposed Interpolating Information Criterion (IIC) provides a valuable theoretical framework to examine performance for overparameterized models. Using the IIC, a PAC-Bayes bound is obtained for a general class of models, characterizing factors which influence generalization performance in the interpolating regime. From the provided bound, we quantify how the test error for overparameterized models achieving effectively zero training error depends on the quality of the implicit regularization imposed by e.g. the combination of model, optimizer, and parameter-initialization scheme; the spectrum of the empirical neural tangent kernel; curvature of the loss landscape; and noise present in the data. △ Less

Submitted 12 November, 2023; originally announced November 2023.

Comments: 9 pages

arXiv:2307.07785 [pdf, other]

The Interpolating Information Criterion for Overparameterized Models

Authors: Liam Hodgkinson, Chris van der Heide, Robert Salomone, Fred Roosta, Michael W. Mahoney

Abstract: The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset. Classical information criteria typically consider the large-data limit, penalizing model size. However, these criteria are not appropriate in modern settings where overparameterized models tend to perform well. For any overparameterized mod… ▽ More The problem of model selection is considered for the setting of interpolating estimators, where the number of model parameters exceeds the size of the dataset. Classical information criteria typically consider the large-data limit, penalizing model size. However, these criteria are not appropriate in modern settings where overparameterized models tend to perform well. For any overparameterized model, we show that there exists a dual underparameterized model that possesses the same marginal likelihood, thus establishing a form of Bayesian duality. This enables more classical methods to be used in the overparameterized setting, revealing the Interpolating Information Criterion, a measure of model quality that naturally incorporates the choice of prior into the model selection. Our new information criterion accounts for prior misspecification, geometric and spectral properties of the model, and is numerically consistent with known empirical and theoretical behavior in this regime. △ Less

Submitted 15 July, 2023; originally announced July 2023.

Comments: 23 pages, 2 figures

arXiv:2307.02501 [pdf, ps, other]

Generalization Guarantees via Algorithm-dependent Rademacher Complexity

Authors: Sarah Sachs, Tim van Erven, Liam Hodgkinson, Rajiv Khanna, Umut Simsekli

Abstract: Algorithm- and data-dependent generalization bounds are required to explain the generalization behavior of modern machine learning algorithms. In this context, there exists information theoretic generalization bounds that involve (various forms of) mutual information, as well as bounds based on hypothesis set stability. We propose a conceptually related, but technically distinct complexity measure… ▽ More Algorithm- and data-dependent generalization bounds are required to explain the generalization behavior of modern machine learning algorithms. In this context, there exists information theoretic generalization bounds that involve (various forms of) mutual information, as well as bounds based on hypothesis set stability. We propose a conceptually related, but technically distinct complexity measure to control generalization error, which is the empirical Rademacher complexity of an algorithm- and data-dependent hypothesis class. Combining standard properties of Rademacher complexity with the convenient structure of this class, we are able to (i) obtain novel bounds based on the finite fractal dimension, which (a) extend previous fractal dimension-type bounds from continuous to finite hypothesis classes, and (b) avoid a mutual information term that was required in prior work; (ii) we greatly simplify the proof of a recent dimension-independent generalization bound for stochastic gradient descent; and (iii) we easily recover results for VC classes and compression schemes, similar to approaches based on conditional mutual information. △ Less

Submitted 4 July, 2023; originally announced July 2023.

arXiv:2306.09262 [pdf, other]

A Heavy-Tailed Algebra for Probabilistic Programming

Authors: Feynman Liang, Liam Hodgkinson, Michael W. Mahoney

Abstract: Despite the successes of probabilistic models based on passing noise through neural networks, recent work has identified that such methods often fail to capture tail behavior accurately, unless the tails of the base distribution are appropriately calibrated. To overcome this deficiency, we propose a systematic approach for analyzing the tails of random variables, and we illustrate how this approac… ▽ More Despite the successes of probabilistic models based on passing noise through neural networks, recent work has identified that such methods often fail to capture tail behavior accurately, unless the tails of the base distribution are appropriately calibrated. To overcome this deficiency, we propose a systematic approach for analyzing the tails of random variables, and we illustrate how this approach can be used during the static analysis (before drawing samples) pass of a probabilistic programming language compiler. To characterize how the tails change under various operations, we develop an algebra which acts on a three-parameter family of tail asymptotics and which is based on the generalized Gamma distribution. Our algebraic operations are closed under addition and multiplication; they are capable of distinguishing sub-Gaussians with differing scales; and they handle ratios sufficiently well to reproduce the tails of most important statistical distributions directly from their definitions. Our empirical results confirm that inference algorithms that leverage our heavy-tailed algebra attain superior performance across a number of density modeling and variational inference tasks. △ Less

Submitted 15 June, 2023; originally announced June 2023.

Comments: 21 pages, 6 figures

arXiv:2305.12313 [pdf, other]

When are ensembles really effective?

Authors: Ryan Theisen, Hyunsuk Kim, Yaoqing Yang, Liam Hodgkinson, Michael W. Mahoney

Abstract: Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new res… ▽ More Ensembling has a long history in statistical data analysis, with many impactful applications. However, in many modern machine learning settings, the benefits of ensembling are less ubiquitous and less obvious. We study, both theoretically and empirically, the fundamental question of when ensembling yields significant performance improvements in classification tasks. Theoretically, we prove new results relating the \emph{ensemble improvement rate} (a measure of how much ensembling decreases the error rate versus a single model, on a relative scale) to the \emph{disagreement-error ratio}. We show that ensembling improves performance significantly whenever the disagreement rate is large relative to the average error rate; and that, conversely, one classifier is often enough whenever the disagreement rate is low relative to the average error rate. On the way to proving these results, we derive, under a mild condition called \emph{competence}, improved upper and lower bounds on the average test error rate of the majority vote classifier. To complement this theory, we study ensembling empirically in a variety of settings, verifying the predictions made by our theory, and identifying practical scenarios where ensembling does and does not result in large performance improvements. Perhaps most notably, we demonstrate a distinct difference in behavior between interpolating models (popular in current practice) and non-interpolating models (such as tree-based methods, where ensembling is popular), demonstrating that ensembling helps considerably more in the latter case than in the former. △ Less

Submitted 20 May, 2023; originally announced May 2023.

arXiv:2210.07612 [pdf, other]

Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes

Authors: Liam Hodgkinson, Chris van der Heide, Fred Roosta, Michael W. Mahoney

Abstract: Despite their importance for assessing reliability of predictions, uncertainty quantification (UQ) measures for machine learning models have only recently begun to be rigorously characterized. One prominent issue is the curse of dimensionality: it is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input d… ▽ More Despite their importance for assessing reliability of predictions, uncertainty quantification (UQ) measures for machine learning models have only recently begun to be rigorously characterized. One prominent issue is the curse of dimensionality: it is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions. We prove that by tuning hyperparameters to maximize marginal likelihood (the empirical Bayes procedure), the performance, as measured by the marginal likelihood, improves monotonically} with the input dimension. On the other hand, we prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent. Cold posteriors, which have recently attracted interest due to their improved performance in certain settings, appear to exacerbate these phenomena. We verify empirically that our results hold for real data, beyond our considered assumptions, and we explore consequences involving synthetic covariates. △ Less

Submitted 25 July, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: 33 pages, 21 figures

arXiv:2205.07918 [pdf, other]

Fat-Tailed Variational Inference with Anisotropic Tail Adaptive Flows

Authors: Feynman Liang, Liam Hodgkinson, Michael W. Mahoney

Abstract: While fat-tailed densities commonly arise as posterior and marginal distributions in robust models and scale mixtures, they present challenges when Gaussian-based variational inference fails to capture tail decay accurately. We first improve previous theory on tails of Lipschitz flows by quantifying how the tails affect the rate of tail decay and by expanding the theory to non-Lipschitz polynomial… ▽ More While fat-tailed densities commonly arise as posterior and marginal distributions in robust models and scale mixtures, they present challenges when Gaussian-based variational inference fails to capture tail decay accurately. We first improve previous theory on tails of Lipschitz flows by quantifying how the tails affect the rate of tail decay and by expanding the theory to non-Lipschitz polynomial flows. Then, we develop an alternative theory for multivariate tail parameters which is sensitive to tail-anisotropy. In doing so, we unveil a fundamental problem which plagues many existing flow-based methods: they can only model tail-isotropic distributions (i.e., distributions having the same tail parameter in every direction). To mitigate this and enable modeling of tail-anisotropic targets, we propose anisotropic tail-adaptive flows (ATAF). Experimental results on both synthetic and real-world targets confirm that ATAF is competitive with prior work while also exhibiting appropriate tail-anisotropy. △ Less

Submitted 16 May, 2022; originally announced May 2022.

arXiv:2108.00781 [pdf, other]

Generalization Bounds using Lower Tail Exponents in Stochastic Optimizers

Authors: Liam Hodgkinson, Umut Şimşekli, Rajiv Khanna, Michael W. Mahoney

Abstract: Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms and their dynamics on generalization performance in realistic non-convex settings is still poorly understood. While recent work has revealed connections between generalization and heavy-tailed behavior in stochastic optimization, this work mainly relied on continuous-time ap… ▽ More Despite the ubiquitous use of stochastic optimization algorithms in machine learning, the precise impact of these algorithms and their dynamics on generalization performance in realistic non-convex settings is still poorly understood. While recent work has revealed connections between generalization and heavy-tailed behavior in stochastic optimization, this work mainly relied on continuous-time approximations; and a rigorous treatment for the original discrete-time iterations is yet to be performed. To bridge this gap, we present novel bounds linking generalization to the lower tail exponent of the transition kernel associated with the optimizer around a local minimum, in both discrete- and continuous-time settings. To achieve this, we first prove a data- and algorithm-dependent generalization bound in terms of the celebrated Fernique-Talagrand functional applied to the trajectory of the optimizer. Then, we specialize this result by exploiting the Markovian structure of stochastic optimizers, and derive bounds in terms of their (data-dependent) transition kernels. We support our theory with empirical results from a variety of neural networks, showing correlations between generalization error and lower tail exponents. △ Less

Submitted 11 July, 2022; v1 submitted 2 August, 2021; originally announced August 2021.

Comments: 22 pages, 6 figures

arXiv:2106.10820 [pdf, other]

Stateful ODE-Nets using Basis Function Expansions

Authors: Alejandro Queiruga, N. Benjamin Erichson, Liam Hodgkinson, Michael W. Mahoney

Abstract: The recently-introduced class of ordinary differential equation networks (ODE-Nets) establishes a fruitful connection between deep learning and dynamical systems. In this work, we reconsider formulations of the weights as continuous-in-depth functions using linear combinations of basis functions which enables us to leverage parameter transformations such as function projections. In turn, this view… ▽ More The recently-introduced class of ordinary differential equation networks (ODE-Nets) establishes a fruitful connection between deep learning and dynamical systems. In this work, we reconsider formulations of the weights as continuous-in-depth functions using linear combinations of basis functions which enables us to leverage parameter transformations such as function projections. In turn, this view allows us to formulate a novel stateful ODE-Block that handles stateful layers. The benefits of this new ODE-Block are twofold: first, it enables incorporating meaningful continuous-in-depth batch normalization layers to achieve state-of-the-art performance; second, it enables compressing the weights through a change of basis, without retraining, while maintaining near state-of-the-art performance and reducing both inference time and memory footprint. Performance is demonstrated by applying our stateful ODE-Block to (a) image classification tasks using convolutional units and (b) sentence-tagging tasks using transformer encoder units. △ Less

Submitted 6 November, 2021; v1 submitted 20 June, 2021; originally announced June 2021.

Comments: Accepted at 35th Conference on Neural Information Processing Systems (NeurIPS 2021)

arXiv:2102.04877 [pdf, other]

Noisy Recurrent Neural Networks

Authors: Soon Hoe Lim, N. Benjamin Erichson, Liam Hodgkinson, Michael W. Mahoney

Abstract: We provide a general framework for studying recurrent neural networks (RNNs) trained by injecting noise into hidden states. Specifically, we consider RNNs that can be viewed as discretizations of stochastic differential equations driven by input data. This framework allows us to study the implicit regularization effect of general noise injection schemes by deriving an approximate explicit regulari… ▽ More We provide a general framework for studying recurrent neural networks (RNNs) trained by injecting noise into hidden states. Specifically, we consider RNNs that can be viewed as discretizations of stochastic differential equations driven by input data. This framework allows us to study the implicit regularization effect of general noise injection schemes by deriving an approximate explicit regularizer in the small noise regime. We find that, under reasonable assumptions, this implicit regularization promotes flatter minima; it biases towards models with more stable dynamics; and, in classification tasks, it favors models with larger classification margin. Sufficient conditions for global stability are obtained, highlighting the phenomenon of stochastic stabilization, where noise injection can improve stability during training. Our theory is supported by empirical results which demonstrate that the RNNs have improved robustness with respect to various input perturbations. △ Less

Submitted 1 December, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

Comments: 38 pages

Journal ref: NeurIPS 2021 (https://proceedings.neurips.cc/paper/2021/hash/29301521774ff3cbd26652b2d5c95996-Abstract.html)

arXiv:2006.12070 [pdf, other]

Lipschitz Recurrent Neural Networks

Authors: N. Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkinson, Michael W. Mahoney

Abstract: Viewing recurrent neural networks (RNNs) as continuous-time dynamical systems, we propose a recurrent unit that describes the hidden state's evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity. This particular functional form facilitates stability analysis of the long-term behavior of the recurrent unit using tools from nonlinear systems theory. In turn, this… ▽ More Viewing recurrent neural networks (RNNs) as continuous-time dynamical systems, we propose a recurrent unit that describes the hidden state's evolution with two parts: a well-understood linear component plus a Lipschitz nonlinearity. This particular functional form facilitates stability analysis of the long-term behavior of the recurrent unit using tools from nonlinear systems theory. In turn, this enables architectural design decisions before experimentation. Sufficient conditions for global stability of the recurrent unit are obtained, motivating a novel scheme for constructing hidden-to-hidden matrices. Our experiments demonstrate that the Lipschitz RNN can outperform existing recurrent units on a range of benchmark tasks, including computer vision, language modeling and speech prediction tasks. Finally, through Hessian-based analysis we demonstrate that our Lipschitz recurrent unit is more robust with respect to input and parameter perturbations as compared to other continuous-time RNNs. △ Less

Submitted 23 April, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

Comments: Published as a conference paper at ICLR 2021

arXiv:2006.06293 [pdf, other]

Multiplicative noise and heavy tails in stochastic optimization

Authors: Liam Hodgkinson, Michael W. Mahoney

Abstract: Although stochastic optimization is central to modern machine learning, the precise mechanisms underlying its success, and in particular, the precise role of the stochasticity, still remain unclear. Modelling stochastic optimization algorithms as discrete random recurrence relations, we show that multiplicative noise, as it commonly arises due to variance in local rates of convergence, results in… ▽ More Although stochastic optimization is central to modern machine learning, the precise mechanisms underlying its success, and in particular, the precise role of the stochasticity, still remain unclear. Modelling stochastic optimization algorithms as discrete random recurrence relations, we show that multiplicative noise, as it commonly arises due to variance in local rates of convergence, results in heavy-tailed stationary behaviour in the parameters. A detailed analysis is conducted for SGD applied to a simple linear regression problem, followed by theoretical results for a much larger class of models (including non-linear and non-convex) and optimizers (including momentum, Adam, and stochastic Newton), demonstrating that our qualitative results hold much more generally. In each case, we describe dependence on key factors, including step size, batch size, and data variability, all of which exhibit similar qualitative behavior to recent empirical results on state-of-the-art neural network models from computer vision and natural language processing. Furthermore, we empirically demonstrate how multiplicative noise and heavy-tailed structure improve capacity for basin hop** and exploration of non-convex loss surfaces, over commonly-considered stochastic dynamics with only additive noise and light-tailed structure. △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: 30 pages, 7 figures

arXiv:2002.09547 [pdf, other]

Stochastic Normalizing Flows

Authors: Liam Hodgkinson, Chris van der Heide, Fred Roosta, Michael W. Mahoney

Abstract: We introduce stochastic normalizing flows, an extension of continuous normalizing flows for maximum likelihood estimation and variational inference (VI) using stochastic differential equations (SDEs). Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated, enabling efficient training of neural SDEs as random neural ordinary differential equ… ▽ More We introduce stochastic normalizing flows, an extension of continuous normalizing flows for maximum likelihood estimation and variational inference (VI) using stochastic differential equations (SDEs). Using the theory of rough paths, the underlying Brownian motion is treated as a latent variable and approximated, enabling efficient training of neural SDEs as random neural ordinary differential equations. These SDEs can be used for constructing efficient Markov chains to sample from the underlying distribution of a given dataset. Furthermore, by considering families of targeted SDEs with prescribed stationary distribution, we can apply VI to the optimization of hyperparameters in stochastic MCMC. △ Less

Submitted 25 February, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: 17 pages, 4 figures

arXiv:2001.09266 [pdf, other]

The reproducing Stein kernel approach for post-hoc corrected sampling

Authors: Liam Hodgkinson, Robert Salomone, Fred Roosta

Abstract: Stein importance sampling is a widely applicable technique based on kernelized Stein discrepancy, which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary… ▽ More Stein importance sampling is a widely applicable technique based on kernelized Stein discrepancy, which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary underlying Polish space. We prove that Stein importance sampling yields consistent estimators for quantities related to a target distribution of interest by using samples obtained from a geometrically ergodic Markov chain with a possibly unknown invariant measure that differs from the desired target. The approach is shown to be valid under conditions that are satisfied for a large number of unadjusted samplers, and is capable of retaining consistency when data subsampling is used. Along the way, a universal theory of reproducing Stein kernels is established, which enables the construction of kernelized Stein discrepancy on general Polish spaces, and provides sufficient conditions for kernels to be convergence-determining on such spaces. These results are of independent interest for the development of future methodology based on kernelized Stein discrepancies. △ Less

Submitted 13 September, 2021; v1 submitted 25 January, 2020; originally announced January 2020.

Comments: 26 pages, 2 figures

MSC Class: 65C05 (Primary) 60J22; 60B10 (Secondary)

arXiv:1907.08410 [pdf, other]

Geometric Rates of Convergence for Kernel-based Sampling Algorithms

Authors: Rajiv Khanna, Liam Hodgkinson, Michael W. Mahoney

Abstract: The rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated. Under verifiable conditions on the chosen kernel and target measure, we establish a near-geometric rate of convergence for target measures that are nearly atomic. Furthermor… ▽ More The rate of convergence of weighted kernel herding (WKH) and sequential Bayesian quadrature (SBQ), two kernel-based sampling algorithms for estimating integrals with respect to some target probability measure, is investigated. Under verifiable conditions on the chosen kernel and target measure, we establish a near-geometric rate of convergence for target measures that are nearly atomic. Furthermore, we show these algorithms perform comparably to the theoretical best possible sampling algorithm under the maximum mean discrepancy. An analysis is also conducted in a distributed setting. Our theoretical developments are supported by empirical observations on simulated data as well as a real world application. △ Less

Submitted 31 October, 2021; v1 submitted 19 July, 2019; originally announced July 2019.

Comments: Accepted to UAI 2021 (Oral)

arXiv:1903.12322 [pdf, other]

Implicit Langevin Algorithms for Sampling From Log-concave Densities

Authors: Liam Hodgkinson, Robert Salomone, Fred Roosta

Abstract: For sampling from a log-concave density, we study implicit integrators resulting from $θ$-method discretization of the overdamped Langevin diffusion stochastic differential equation. Theoretical and algorithmic properties of the resulting sampling methods for $ θ\in [0,1] $ and a range of step sizes are established. Our results generalize and extend prior works in several directions. In particular… ▽ More For sampling from a log-concave density, we study implicit integrators resulting from $θ$-method discretization of the overdamped Langevin diffusion stochastic differential equation. Theoretical and algorithmic properties of the resulting sampling methods for $ θ\in [0,1] $ and a range of step sizes are established. Our results generalize and extend prior works in several directions. In particular, for $θ\ge1/2$, we prove geometric ergodicity and stability of the resulting methods for all step sizes. We show that obtaining subsequent samples amounts to solving a strongly-convex optimization problem, which is readily achievable using one of numerous existing methods. Numerical examples supporting our theoretical analysis are also presented. △ Less

Submitted 10 July, 2021; v1 submitted 28 March, 2019; originally announced March 2019.

Showing 1–16 of 16 results for author: Hodgkinson, L