-
Fantastic Generalization Measures are Nowhere to be Found
Authors:
Michael Gastpar,
Ido Nachum,
Jonathan Shafer,
Thomas Weinberger
Abstract:
We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in t…
▽ More
We study the notion of a generalization bound being uniformly tight, meaning that the difference between the bound and the population loss is small for all learning algorithms and all population distributions. Numerous generalization bounds have been proposed in the literature as potential explanations for the ability of neural networks to generalize in the overparameterized setting. However, in their paper ``Fantastic Generalization Measures and Where to Find Them,'' Jiang et al. (2020) examine more than a dozen generalization bounds, and show empirically that none of them are uniformly tight. This raises the question of whether uniformly-tight generalization bounds are at all possible in the overparameterized setting. We consider two types of generalization bounds: (1) bounds that may depend on the training set and the learned hypothesis (e.g., margin bounds). We prove mathematically that no such bound can be uniformly tight in the overparameterized setting; (2) bounds that may in addition also depend on the learning algorithm (e.g., stability bounds). For these bounds, we show a trade-off between the algorithm's performance and the bound's tightness. Namely, if the algorithm achieves good accuracy on certain distributions, then no generalization bound can be uniformly tight for it in the overparameterized setting. We explain how these formal results can, in our view, inform research on generalization bounds for neural networks, while stressing that other interpretations of these results are also possible.
△ Less
Submitted 28 November, 2023; v1 submitted 24 September, 2023;
originally announced September 2023.
-
Finite Littlestone Dimension Implies Finite Information Complexity
Authors:
Aditya Pradeep,
Ido Nachum,
Michael Gastpar
Abstract:
We prove that every online learnable class of functions of Littlestone dimension $d$ admits a learning algorithm with finite information complexity. Towards this end, we use the notion of a globally stable algorithm. Generally, the information complexity of such a globally stable algorithm is large yet finite, roughly exponential in $d$. We also show there is room for improvement; for a canonical…
▽ More
We prove that every online learnable class of functions of Littlestone dimension $d$ admits a learning algorithm with finite information complexity. Towards this end, we use the notion of a globally stable algorithm. Generally, the information complexity of such a globally stable algorithm is large yet finite, roughly exponential in $d$. We also show there is room for improvement; for a canonical online learnable class, indicator functions of affine subspaces of dimension $d$, the information complexity can be upper bounded logarithmically in $d$.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
A Johnson--Lindenstrauss Framework for Randomly Initialized CNNs
Authors:
Ido Nachum,
Jan Hązła,
Michael Gastpar,
Anatoly Khina
Abstract:
How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson--Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known…
▽ More
How does the geometric representation of a dataset change after the application of each randomly initialized layer of a neural network? The celebrated Johnson--Lindenstrauss lemma answers this question for linear fully-connected neural networks (FNNs), stating that the geometry is essentially preserved. For FNNs with the ReLU activation, the angle between two inputs contracts according to a known map**. The question for non-linear convolutional neural networks (CNNs) becomes much more intricate. To answer this question, we introduce a geometric framework. For linear CNNs, we show that the Johnson--Lindenstrauss lemma continues to hold, namely, that the angle between two inputs is preserved. For CNNs with ReLU activation, on the other hand, the behavior is richer: The angle between the outputs contracts, where the level of contraction depends on the nature of the inputs. In particular, after one layer, the geometry of natural images is essentially preserved, whereas for Gaussian correlated inputs, CNNs exhibit the same contracting behavior as FNNs with ReLU activation.
△ Less
Submitted 7 March, 2022; v1 submitted 3 November, 2021;
originally announced November 2021.
-
Regularization by Misclassification in ReLU Neural Networks
Authors:
Elisabetta Cornacchia,
Jan Hązła,
Ido Nachum,
Amir Yehudayoff
Abstract:
We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the fir…
▽ More
We study the implicit bias of ReLU neural networks trained by a variant of SGD where at each step, the label is changed with probability $p$ to a random label (label smoothing being a close variant of this procedure). Our experiments demonstrate that label noise propels the network to a sparse solution in the following sense: for a typical input, a small fraction of neurons are active, and the firing pattern of the hidden layers is sparser. In fact, for some instances, an appropriate amount of label noise does not only sparsify the network but further reduces the test error. We then turn to the theoretical analysis of such sparsification mechanisms, focusing on the extremal case of $p=1$. We show that in this case, the network withers as anticipated from experiments, but surprisingly, in different ways that depend on the learning rate and the presence of bias, with either weights vanishing or neurons ceasing to fire.
△ Less
Submitted 3 November, 2021;
originally announced November 2021.
-
Almost-Reed--Muller Codes Achieve Constant Rates for Random Errors
Authors:
Emmanuel Abbe,
Jan Hązła,
Ido Nachum
Abstract:
This paper considers '$δ$-almost Reed-Muller codes', i.e., linear codes spanned by evaluations of all but a $δ$ fraction of monomials of degree at most $d$. It is shown that for any $δ> 0$ and any $\varepsilon>0$, there exists a family of $δ$-almost Reed-Muller codes of constant rate that correct $1/2-\varepsilon$ fraction of random errors with high probability. For exact Reed-Muller codes, the an…
▽ More
This paper considers '$δ$-almost Reed-Muller codes', i.e., linear codes spanned by evaluations of all but a $δ$ fraction of monomials of degree at most $d$. It is shown that for any $δ> 0$ and any $\varepsilon>0$, there exists a family of $δ$-almost Reed-Muller codes of constant rate that correct $1/2-\varepsilon$ fraction of random errors with high probability. For exact Reed-Muller codes, the analogous result is not known and represents a weaker version of the longstanding conjecture that Reed-Muller codes achieve capacity for random errors (Abbe-Shpilka-Wigderson STOC '15). Our approach is based on the recent polarization result for Reed-Muller codes, combined with a combinatorial approach to establishing inequalities between the Reed-Muller code entropies.
△ Less
Submitted 5 October, 2021; v1 submitted 20 April, 2020;
originally announced April 2020.
-
On Symmetry and Initialization for Neural Networks
Authors:
Ido Nachum,
Amir Yehudayoff
Abstract:
This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen…
▽ More
This work provides an additional step in the theoretical understanding of neural networks. We consider neural networks with one hidden layer and show that when learning symmetric functions, one can choose initial conditions so that standard SGD training efficiently produces generalization guarantees. We empirically verify this and show that this does not hold when the initial conditions are chosen at random. The proof of convergence investigates the interaction between the two layers of the network. Our results highlight the importance of using symmetry in the design of neural networks.
△ Less
Submitted 1 July, 2019;
originally announced July 2019.
-
Average-Case Information Complexity of Learning
Authors:
Ido Nachum,
Amir Yehudayoff
Abstract:
How many bits of information are revealed by a learning algorithm for a concept class of VC-dimension $d$? Previous works have shown that even for $d=1$ the amount of information may be unbounded (tend to $\infty$ with the universe size). Can it be that all concepts in the class require leaking a large amount of information? We show that typically concepts do not require leakage. There exists a pr…
▽ More
How many bits of information are revealed by a learning algorithm for a concept class of VC-dimension $d$? Previous works have shown that even for $d=1$ the amount of information may be unbounded (tend to $\infty$ with the universe size). Can it be that all concepts in the class require leaking a large amount of information? We show that typically concepts do not require leakage. There exists a proper learning algorithm that reveals $O(d)$ bits of information for most concepts in the class. This result is a special case of a more general phenomenon we explore. If there is a low information learner when the algorithm {\em knows} the underlying distribution on inputs, then there is a learner that reveals little information on an average concept {\em without knowing} the distribution on inputs.
△ Less
Submitted 24 November, 2018;
originally announced November 2018.
-
Direct Automated Quantitative Measurement of Spine via Cascade Amplifier Regression Network
Authors:
Shumao Pang,
Stephanie Leung,
Ilanit Ben Nachum,
Qian** Feng,
Shuo Li
Abstract:
Automated quantitative measurement of the spine (i.e., multiple indices estimation of heights, widths, areas, and so on for the vertebral body and disc) is of the utmost importance in clinical spinal disease diagnoses, such as osteoporosis, intervertebral disc degeneration, and lumbar disc herniation, yet still an unprecedented challenge due to the variety of spine structure and the high dimension…
▽ More
Automated quantitative measurement of the spine (i.e., multiple indices estimation of heights, widths, areas, and so on for the vertebral body and disc) is of the utmost importance in clinical spinal disease diagnoses, such as osteoporosis, intervertebral disc degeneration, and lumbar disc herniation, yet still an unprecedented challenge due to the variety of spine structure and the high dimensionality of indices to be estimated. In this paper, we propose a novel cascade amplifier regression network (CARN), which includes the CARN architecture and local shape-constrained manifold regularization (LSCMR) loss function, to achieve accurate direct automated multiple indices estimation. The CARN architecture is composed of a cascade amplifier network (CAN) for expressive feature embedding and a linear regression model for multiple indices estimation. The CAN consists of cascade amplifier units (AUs), which are used for selective feature reuse by stimulating effective feature and suppressing redundant feature during propagating feature map between adjacent layers, thus an expressive feature embedding is obtained. During training, the LSCMR is utilized to alleviate overfitting and generate realistic estimation by learning the multiple indices distribution. Experiments on MR images of 195 subjects show that the proposed CARN achieves impressive performance with mean absolute errors of 1.2496 mm, 1.2887 mm, and 1.2692 mm for estimation of 15 heights of discs, 15 heights of vertebral bodies, and total indices respectively. The proposed method has great potential in clinical spinal disease diagnoses.
△ Less
Submitted 14 June, 2018;
originally announced June 2018.
-
On the Perceptron's Compression
Authors:
Shay Moran,
Ido Nachum,
Itai Panasoff,
Amir Yehudayoff
Abstract:
We study and provide exposition to several phenomena that are related to the perceptron's compression. One theme concerns modifications of the perceptron algorithm that yield better guarantees on the margin of the hyperplane it outputs. These modifications can be useful in training neural networks as well, and we demonstrate them with some experimental data. In a second theme, we deduce conclusion…
▽ More
We study and provide exposition to several phenomena that are related to the perceptron's compression. One theme concerns modifications of the perceptron algorithm that yield better guarantees on the margin of the hyperplane it outputs. These modifications can be useful in training neural networks as well, and we demonstrate them with some experimental data. In a second theme, we deduce conclusions from the perceptron's compression in various contexts.
△ Less
Submitted 14 June, 2018;
originally announced June 2018.
-
A Direct Sum Result for the Information Complexity of Learning
Authors:
Ido Nachum,
Jonathan Shafer,
Amir Yehudayoff
Abstract:
How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al. (2018), where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain…
▽ More
How many bits of information are required to PAC learn a class of hypotheses of VC dimension $d$? The mathematical setting we follow is that of Bassily et al. (2018), where the value of interest is the mutual information $\mathrm{I}(S;A(S))$ between the input sample $S$ and the hypothesis outputted by the learning algorithm $A$. We introduce a class of functions of VC dimension $d$ over the domain $\mathcal{X}$ with information complexity at least $Ω\left(d\log \log \frac{|\mathcal{X}|}{d}\right)$ bits for any consistent and proper algorithm (deterministic or random). Bassily et al. proved a similar (but quantitatively weaker) result for the case $d=1$.
The above result is in fact a special case of a more general phenomenon we explore. We define the notion of information complexity of a given class of functions $\mathcal{H}$. Intuitively, it is the minimum amount of information that an algorithm for $\mathcal{H}$ must retain about its input to ensure consistency and properness. We prove a direct sum result for information complexity in this context; roughly speaking, the information complexity sums when combining several classes.
△ Less
Submitted 15 April, 2018;
originally announced April 2018.
-
Learners that Use Little Information
Authors:
Raef Bassily,
Shay Moran,
Ido Nachum,
Jonathan Shafer,
Amir Yehudayoff
Abstract:
We study learning algorithms that are restricted to using a small amount of information from their input sample. We introduce a category of learning algorithms we term $d$-bit information learners, which are algorithms whose output conveys at most $d$ bits of information of their input. A central theme in this work is that such algorithms generalize.
We focus on the learning capacity of these al…
▽ More
We study learning algorithms that are restricted to using a small amount of information from their input sample. We introduce a category of learning algorithms we term $d$-bit information learners, which are algorithms whose output conveys at most $d$ bits of information of their input. A central theme in this work is that such algorithms generalize.
We focus on the learning capacity of these algorithms, and prove sample complexity bounds with tight dependencies on the confidence and error parameters. We also observe connections with well studied notions such as sample compression schemes, Occam's razor, PAC-Bayes and differential privacy.
We discuss an approach that allows us to prove upper bounds on the amount of information that algorithms reveal about their inputs, and also provide a lower bound by showing a simple concept class for which every (possibly randomized) empirical risk minimizer must reveal a lot of information. On the other hand, we show that in the distribution-dependent setting every VC class has empirical risk minimizers that do not reveal a lot of information.
△ Less
Submitted 27 February, 2018; v1 submitted 14 October, 2017;
originally announced October 2017.
-
Direct Estimation of Regional Wall Thicknesses via Residual Recurrent Neural Network
Authors:
Wufeng Xue,
Ilanit Ben Nachum,
Sachin Pandey,
James Warrington,
Stephanie Leung,
Shuo Li
Abstract:
Accurate estimation of regional wall thicknesses (RWT) of left ventricular (LV) myocardium from cardiac MR sequences is of significant importance for identification and diagnosis of cardiac disease. Existing RWT estimation still relies on segmentation of LV myocardium, which requires strong prior information and user interaction. No work has been devoted into direct estimation of RWT from cardiac…
▽ More
Accurate estimation of regional wall thicknesses (RWT) of left ventricular (LV) myocardium from cardiac MR sequences is of significant importance for identification and diagnosis of cardiac disease. Existing RWT estimation still relies on segmentation of LV myocardium, which requires strong prior information and user interaction. No work has been devoted into direct estimation of RWT from cardiac MR images due to the diverse shapes and structures for various subjects and cardiac diseases, as well as the complex regional deformation of LV myocardium during the systole and diastole phases of the cardiac cycle. In this paper, we present a newly proposed Residual Recurrent Neural Network (ResRNN) that fully leverages the spatial and temporal dynamics of LV myocardium to achieve accurate frame-wise RWT estimation. Our ResRNN comprises two paths: 1) a feed forward convolution neural network (CNN) for effective and robust CNN embedding learning of various cardiac images and preliminary estimation of RWT from each frame itself independently, and 2) a recurrent neural network (RNN) for further improving the estimation by modeling spatial and temporal dynamics of LV myocardium. For the RNN path, we design for cardiac sequences a Circle-RNN to eliminate the effect of null hidden input for the first time-step. Our ResRNN is capable of obtaining accurate estimation of cardiac RWT with Mean Absolute Error of 1.44mm (less than 1-pixel error) when validated on cardiac MR sequences of 145 subjects, evidencing its great potential in clinical cardiac function assessment.
△ Less
Submitted 26 May, 2017;
originally announced May 2017.