Search | arXiv e-print repository

Bandit-Driven Batch Selection for Robust Learning under Label Noise

Authors: Michal Lisicki, Mihai Nica, Graham W. Taylor

Abstract: We introduce a novel approach for batch selection in Stochastic Gradient Descent (SGD) training, leveraging combinatorial bandit algorithms. Our methodology focuses on optimizing the learning process in the presence of label noise, a prevalent issue in real-world datasets. Experimental evaluations on the CIFAR-10 dataset reveal that our approach consistently outperforms existing methods across var… ▽ More We introduce a novel approach for batch selection in Stochastic Gradient Descent (SGD) training, leveraging combinatorial bandit algorithms. Our methodology focuses on optimizing the learning process in the presence of label noise, a prevalent issue in real-world datasets. Experimental evaluations on the CIFAR-10 dataset reveal that our approach consistently outperforms existing methods across various levels of label corruption. Importantly, we achieve this superior performance without incurring the computational overhead commonly associated with auxiliary neural network models. This work presents a balanced trade-off between computational efficiency and model efficacy, offering a scalable solution for complex machine learning applications. △ Less

Submitted 31 October, 2023; originally announced November 2023.

Comments: WANT@NeurIPS 2023 & OPT@NeurIPS 2023

arXiv:2310.12079 [pdf, other]

Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Authors: Mufan Bill Li, Mihai Nica

Abstract: Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about "ordinary" unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation ba… ▽ More Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about "ordinary" unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks. Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization: (i) a fully connected ResNet with a $d^{-1/2}$ factor on the residual branch, where $d$ is the network depth. (ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d^{-1/2}$. Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $ρ_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - ρ_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$. These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with sha** activation functions. △ Less

Submitted 18 April, 2024; v1 submitted 18 October, 2023; originally announced October 2023.

arXiv:2309.02530 [pdf, other]

Diffusion on the Probability Simplex

Authors: Griffin Floto, Thorsteinn Jonsson, Mihai Nica, Scott Sanner, Eric Zhengyu Zhu

Abstract: Diffusion models learn to reverse the progressive noising of a data distribution to create a generative model. However, the desired continuous nature of the noising process can be at odds with discrete data. To deal with this tension between continuous and discrete objects, we propose a method of performing diffusion on the probability simplex. Using the probability simplex naturally creates an in… ▽ More Diffusion models learn to reverse the progressive noising of a data distribution to create a generative model. However, the desired continuous nature of the noising process can be at odds with discrete data. To deal with this tension between continuous and discrete objects, we propose a method of performing diffusion on the probability simplex. Using the probability simplex naturally creates an interpretation where points correspond to categorical probability distributions. Our method uses the softmax function applied to an Ornstein-Unlenbeck Process, a well-known stochastic differential equation. We find that our methodology also naturally extends to include diffusion on the unit cube which has applications for bounded image generation. △ Less

Submitted 11 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

arXiv:2306.01513 [pdf, other]

Network Degeneracy as an Indicator of Training Performance: Comparing Finite and Infinite Width Angle Predictions

Authors: Cameron Jakub, Mihai Nica

Abstract: Neural networks are powerful functions with widespread use, but the theoretical behaviour of these functions is not fully understood. Creating deep neural networks by stacking many layers has achieved exceptional performance in many applications and contributed to the recent explosion of these methods. Previous works have shown that depth can exponentially increase the expressibility of the networ… ▽ More Neural networks are powerful functions with widespread use, but the theoretical behaviour of these functions is not fully understood. Creating deep neural networks by stacking many layers has achieved exceptional performance in many applications and contributed to the recent explosion of these methods. Previous works have shown that depth can exponentially increase the expressibility of the network. However, as networks get deeper and deeper, they are more susceptible to becoming degenerate. We observe this degeneracy in the sense that on initialization, inputs tend to become more and more correlated as they travel through the layers of the network. If a network has too many layers, it tends to approximate a (random) constant function, making it effectively incapable of distinguishing between inputs. This seems to affect the training of the network and cause it to perform poorly, as we empirically investigate in this paper. We use a simple algorithm that can accurately predict the level of degeneracy for any given fully connected ReLU network architecture, and demonstrate how the predicted degeneracy relates to training dynamics of the network. We also compare this prediction to predictions derived using infinite width networks. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 5 pages, comments welcome

arXiv:2305.02299 [pdf, other]

Dynamic Sparse Training with Structured Sparsity

Authors: Mike Lasby, Anna Golubeva, Utku Evci, Mihai Nica, Yani Ioannou

Abstract: Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we prop… ▽ More Dynamic Sparse Training (DST) methods achieve state-of-the-art results in sparse neural network training, matching the generalization of dense models while enabling sparse training and inference. Although the resulting models are highly sparse and theoretically less computationally expensive, achieving speedups with unstructured sparsity on real-world hardware is challenging. In this work, we propose a sparse-to-sparse DST method, Structured RigL (SRigL), to learn a variant of fine-grained structured N:M sparsity by imposing a constant fan-in constraint. Using our empirical analysis of existing DST methods at high sparsity, we additionally employ a neuron ablation method which enables SRigL to achieve state-of-the-art sparse-to-sparse structured DST performance on a variety of Neural Network (NN) architectures. Using a 90% sparse linear layer, we demonstrate a real-world acceleration of 3.4x/2.5x on CPU for online inference and 1.7x/13.0x on GPU for inference with a batch size of 256 when compared to equivalent dense/unstructured (CSR) sparse layers, respectively. △ Less

Submitted 21 February, 2024; v1 submitted 3 May, 2023; originally announced May 2023.

Comments: ICLR 2024, 29 pages, 22 figures

arXiv:2302.09712 [pdf, other]

Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization

Authors: Cameron Jakub, Mihai Nica

Abstract: Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function o… ▽ More Despite remarkable performance on a variety of tasks, many properties of deep neural networks are not yet theoretically understood. One such mystery is the depth degeneracy phenomenon: the deeper you make your network, the closer your network is to a constant function on initialization. In this paper, we examine the evolution of the angle between two inputs to a ReLU neural network as a function of the number of layers. By using combinatorial expansions, we find precise formulas for how fast this angle goes to zero as depth increases. These formulas capture microscopic fluctuations that are not visible in the popular framework of infinite width limits, and leads to qualitatively different predictions. We validate our theoretical results with Monte Carlo experiments and show that our results accurately approximate finite network behaviour. The formulas are given in terms of the mixed moments of correlated Gaussians passed through the ReLU function. We also find a surprising combinatorial connection between these mixed moments and the Bessel numbers that allows us to explicitly evaluate these moments. △ Less

Submitted 26 May, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

Comments: Minor updates and exposition improved. 37 pages, comments welcome

arXiv:2207.09408 [pdf, other]

Bounding generalization error with input compression: An empirical study with infinite-width networks

Authors: Angus Galloway, Anna Golubeva, Mahmoud Salem, Mihai Nica, Yani Ioannou, Graham W. Taylor

Abstract: Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that often relies on availability of held-out data. The ability to better predict GE based on a single training set may yield overarching DNN design principles to reduce a reliance on trial-and-error, along with other performance assessment advantages. In search of a quantity relevant to GE, we investigate… ▽ More Estimating the Generalization Error (GE) of Deep Neural Networks (DNNs) is an important task that often relies on availability of held-out data. The ability to better predict GE based on a single training set may yield overarching DNN design principles to reduce a reliance on trial-and-error, along with other performance assessment advantages. In search of a quantity relevant to GE, we investigate the Mutual Information (MI) between the input and final layer representations, using the infinite-width DNN limit to bound MI. An existing input compression-based GE bound is used to link MI and GE. To the best of our knowledge, this represents the first empirical study of this bound. In our attempt to empirically falsify the theoretical bound, we find that it is often tight for best-performing models. Furthermore, it detects randomization of training labels in many cases, reflects test-time perturbation robustness, and works well given only few training samples. These results are promising given that input compression is broadly applicable where MI can be estimated with confidence. △ Less

Submitted 19 July, 2022; originally announced July 2022.

Comments: 12 pages main content, 26 pages total

arXiv:2206.02768 [pdf, other]

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

Authors: Mufan Bill Li, Mihai Nica, Daniel M. Roy

Abstract: The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that sha** the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current inf… ▽ More The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that sha** the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this sha** method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function. △ Less

Submitted 14 June, 2023; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: 48 pages, 10 figures. Advances in Neural Information Processing Systems (2022)

arXiv:2111.15646 [pdf, other]

The Exponentially Tilted Gaussian Prior for Variational Autoencoders

Authors: Griffin Floto, Stefan Kremer, Mihai Nica

Abstract: An important property for deep neural networks is the ability to perform robust out-of-distribution detection on previously unseen data. This property is essential for safety purposes when deploying models for real world applications. Recent studies show that probabilistic generative models can perform poorly on this task, which is surprising given that they seek to estimate the likelihood of trai… ▽ More An important property for deep neural networks is the ability to perform robust out-of-distribution detection on previously unseen data. This property is essential for safety purposes when deploying models for real world applications. Recent studies show that probabilistic generative models can perform poorly on this task, which is surprising given that they seek to estimate the likelihood of training data. To alleviate this issue, we propose the exponentially tilted Gaussian prior distribution for the Variational Autoencoder (VAE) which pulls points onto the surface of a hyper-sphere in latent space. This achieves state-of-the art results on the area under the curve-receiver operator characteristics metric using just the log-likelihood that the VAE naturally assigns. Because this prior is a simple modification of the traditional VAE prior, it is faster and easier to implement than competitive methods. △ Less

Submitted 12 April, 2022; v1 submitted 30 November, 2021; originally announced November 2021.

arXiv:2110.08664 [pdf, other]

Finding Critical Scenarios for Automated Driving Systems: A Systematic Literature Review

Authors: Xinhai Zhang, Jianbo Tao, Kaige Tan, Martin Törngren, José Manuel Gaspar Sánchez, Muhammad Rusyadi Ramli, Xin Tao, Magnus Gyllenhammar, Franz Wotawa, Naveen Mohan, Mihai Nica, Hermann Felbinger

Abstract: Scenario-based approaches have been receiving a huge amount of attention in research and engineering of automated driving systems. Due to the complexity and uncertainty of the driving environment, and the complexity of the driving task itself, the number of possible driving scenarios that an ADS or ADAS may encounter is virtually infinite. Therefore it is essential to be able to reason about the i… ▽ More Scenario-based approaches have been receiving a huge amount of attention in research and engineering of automated driving systems. Due to the complexity and uncertainty of the driving environment, and the complexity of the driving task itself, the number of possible driving scenarios that an ADS or ADAS may encounter is virtually infinite. Therefore it is essential to be able to reason about the identification of scenarios and in particular critical ones that may impose unacceptable risk if not considered. Critical scenarios are particularly important to support design, verification and validation efforts, and as a basis for a safety case. In this paper, we present the results of a systematic literature review in the context of autonomous driving. The main contributions are: (i) introducing a comprehensive taxonomy for critical scenario identification methods; (ii) giving an overview of the state-of-the-art research based on the taxonomy encompassing 86 papers between 2017 and 2020; and (iii) identifying open issues and directions for further research. The provided taxonomy comprises three main perspectives encompassing the problem definition (the why), the solution (the methods to derive scenarios), and the assessment of the established scenarios. In addition, we discuss open research issues considering the perspectives of coverage, practicability, and scenario space explosion. △ Less

Submitted 16 October, 2021; originally announced October 2021.

Comments: 37 pages, 24 figures

arXiv:2106.04013 [pdf, other]

The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

Authors: Mufan Bill Li, Mihai Nica, Daniel M. Roy

Abstract: Theoretical results show that neural networks can be approximated by Gaussian processes in the infinite-width limit. However, for fully connected networks, it has been previously shown that for any fixed network width, $n$, the Gaussian approximation gets worse as the network depth, $d$, increases. Given that modern networks are deep, this raises the question of how well modern architectures, like… ▽ More Theoretical results show that neural networks can be approximated by Gaussian processes in the infinite-width limit. However, for fully connected networks, it has been previously shown that for any fixed network width, $n$, the Gaussian approximation gets worse as the network depth, $d$, increases. Given that modern networks are deep, this raises the question of how well modern architectures, like ResNets, are captured by the infinite-width limit. To provide a better approximation, we study ReLU ResNets in the infinite-depth-and-width limit, where both depth and width tend to infinity as their ratio, $d/n$, remains constant. In contrast to the Gaussian infinite-width limit, we show theoretically that the network exhibits log-Gaussian behaviour at initialization in the infinite-depth-and-width limit, with parameters depending on the ratio $d/n$. Using Monte Carlo simulations, we demonstrate that even basic properties of standard ResNet architectures are poorly captured by the Gaussian limit, but remarkably well captured by our log-Gaussian limit. Moreover, our analysis reveals that ReLU ResNets at initialization are hypoactivated: fewer than half of the ReLUs are activated. Additionally, we calculate the interlayer correlations, which have the effect of exponentially increasing the variance of the network output. Based on our analysis, we introduce Balanced ResNets, a simple architecture modification, which eliminates hypoactivation and interlayer correlations and is more amenable to theoretical analysis. △ Less

Submitted 27 October, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2001.06145 [pdf, other]

doi 10.1016/j.jcp.2020.109672

A Derivative-Free Method for Solving Elliptic Partial Differential Equations with Deep Neural Networks

Authors: Jihun Han, Mihai Nica, Adam R Stinchcombe

Abstract: We introduce a deep neural network based method for solving a class of elliptic partial differential equations. We approximate the solution of the PDE with a deep neural network which is trained under the guidance of a probabilistic representation of the PDE in the spirit of the Feynman-Kac formula. The solution is given by an expectation of a martingale process driven by a Brownian motion. As Bro… ▽ More We introduce a deep neural network based method for solving a class of elliptic partial differential equations. We approximate the solution of the PDE with a deep neural network which is trained under the guidance of a probabilistic representation of the PDE in the spirit of the Feynman-Kac formula. The solution is given by an expectation of a martingale process driven by a Brownian motion. As Brownian walkers explore the domain, the deep neural network is iteratively trained using a form of reinforcement learning. Our method is a 'Derivative-Free Loss Method' since it does not require the explicit calculation of the derivatives of the neural network with respect to the input neurons in order to compute the training loss. The advantages of our method are showcased in a series of test problems: a corner singularity problem, an interface problem, and an application to a chemotaxis population model. △ Less

Submitted 16 January, 2020; originally announced January 2020.

Comments: 25 pages, 4 figures

arXiv:1909.05989 [pdf, other]

Finite Depth and Width Corrections to the Neural Tangent Kernel

Authors: Boris Hanin, Mihai Nica

Abstract: We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that f… ▽ More We prove the precise scaling, at finite depth and width, for the mean and variance of the neural tangent kernel (NTK) in a randomly initialized ReLU network. The standard deviation is exponential in the ratio of network depth to width. Thus, even in the limit of infinite overparameterization, the NTK is not deterministic if depth and width simultaneously tend to infinity. Moreover, we prove that for such deep and wide networks, the NTK has a non-trivial evolution during training by showing that the mean of its first SGD update is also exponential in the ratio of network depth to width. This is sharp contrast to the regime where depth is fixed and network width is very large. Our results suggest that, unlike relatively shallow and wide networks, deep and wide ReLU networks are capable of learning data-dependent features even in the so-called lazy training regime. △ Less

Submitted 12 September, 2019; originally announced September 2019.

Comments: 27 pages, 2 figures, comments welcome

arXiv:1509.03327 [pdf, other]

doi 10.1017/S026996481600022X

Optimal Strategy in "Guess Who?": Beyond Binary Search

Authors: Mihai Nica

Abstract: "Guess Who?" is a popular two player game where players ask "Yes"/"No" questions to search for their opponent's secret identity from a pool of possible candidates. This is modeled as a simple stochastic game. Using this model, the optimal strategy is explicitly found. Contrary to popular belief, performing a binary search is \emph{not} always optimal. Instead, the optimal strategy for the player w… ▽ More "Guess Who?" is a popular two player game where players ask "Yes"/"No" questions to search for their opponent's secret identity from a pool of possible candidates. This is modeled as a simple stochastic game. Using this model, the optimal strategy is explicitly found. Contrary to popular belief, performing a binary search is \emph{not} always optimal. Instead, the optimal strategy for the player who trails is to make certain bold plays in an attempt catch up. This is discovered by first analyzing a continuous version of the game where players play indefinitely and the winner is never decided after finitely many rounds. △ Less

Submitted 15 January, 2016; v1 submitted 8 September, 2015; originally announced September 2015.

Comments: 13 pages, 2 figures. Derivation rewritten from the point of view of "Continuous Guess Who?". To appear in Probability in the Engineering and Informational Sciences

MSC Class: 91A15; 60G40; 62L15; 91A60

Journal ref: Prob. Eng. Inf. Sci. 30 (2016) 576-592

Showing 1–14 of 14 results for author: Nica, M