Search | arXiv e-print repository

Online Learning and Information Exponents: On The Importance of Batch size, and Time/Complexity Tradeoffs

Authors: Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan

Abstract: We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gr… ▽ More We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned \citep{arous2021online} and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, \textit{Correlation loss SGD}, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.15459 [pdf, other]

Repetita Iuvant: Data Repetition Allows SGD to Learn High-Dimensional Multi-Index Functions

Authors: Luca Arnaboldi, Yatin Dandi, Florent Krzakala, Luca Pesce, Ludovic Stephan

Abstract: Neural networks can identify low-dimensional relevant structures within high-dimensional noisy data, yet our mathematical understanding of how they do so remains scarce. Here, we investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms, and discuss how they learn pertinent features in multi-index models, that is target functions with low-dimensi… ▽ More Neural networks can identify low-dimensional relevant structures within high-dimensional noisy data, yet our mathematical understanding of how they do so remains scarce. Here, we investigate the training dynamics of two-layer shallow neural networks trained with gradient-based algorithms, and discuss how they learn pertinent features in multi-index models, that is target functions with low-dimensional relevant directions. In the high-dimensional regime, where the input dimension $d$ diverges, we show that a simple modification of the idealized single-pass gradient descent training scenario, where data can now be repeated or iterated upon twice, drastically improves its computational efficiency. In particular, it surpasses the limitations previously believed to be dictated by the Information and Leap exponents associated with the target function to be learned. Our results highlight the ability of networks to learn relevant structures from data alone without any pre-processing. More precisely, we show that (almost) all directions are learned with at most $O(d \log d)$ steps. Among the exceptions is a set of hard functions that includes sparse parities. In the presence of coupling between directions, however, these can be learned sequentially through a hierarchical mechanism that generalizes the notion of staircase functions. Our results are proven by a rigorous study of the evolution of the relevant statistics for high-dimensional dynamics. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2402.04980 [pdf, other]

Asymptotics of feature learning in two-layer networks after one gradient-step

Authors: Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M. Lu, Lenka Zdeborová, Bruno Loureiro

Abstract: In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), w… ▽ More In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. △ Less

Submitted 4 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.03220 [pdf, other]

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

Authors: Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborová, Florent Krzakala

Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the… ▽ More We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory. △ Less

Submitted 30 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted at the International Conference on Machine Learning (ICML), 2024

arXiv:2309.04877 [pdf, other]

A Gentle Introduction to Gradient-Based Optimization and Variational Inequalities for Machine Learning

Authors: Neha S. Wadia, Yatin Dandi, Michael I. Jordan

Abstract: The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based method… ▽ More The rapid progress in machine learning in recent years has been based on a highly productive connection to gradient-based optimization. Further progress hinges in part on a shift in focus from pattern recognition to decision-making and multi-agent problems. In these broader settings, new mathematical challenges emerge that involve equilibria and game theory instead of optima. Gradient-based methods remain essential -- given the high dimensionality and large scale of machine-learning problems -- but simple gradient descent is no longer the point of departure for algorithm design. We provide a gentle introduction to a broader framework for gradient-based algorithms in machine learning, beginning with saddle points and monotone games, and proceeding to general variational inequalities. While we provide convergence proofs for several of the algorithms that we present, our main focus is that of providing motivation and intuition. △ Less

Submitted 26 February, 2024; v1 submitted 9 September, 2023; originally announced September 2023.

Comments: 36 pages, 7 figures; minor corrections

arXiv:2305.18270 [pdf, other]

How Two-Layer Neural Networks Learn, One (Giant) Step at a Time

Authors: Yatin Dandi, Florent Krzakala, Bruno Loureiro, Luca Pesce, Ludovic Stephan

Abstract: We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization. We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size… ▽ More We investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to improvement in the approximation capacity with respect to the initialization. We compare the influence of batch size and that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = \mathcal{O}(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = \mathcal{O}(d^2)$ is essential for neurons to specialize to multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist ``hard'' directions requiring $n = \mathcal{O}(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. The picture drastically improves over multiple gradient steps: we show that a batch-size of $n = \mathcal{O}(d)$ is indeed enough to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allows to drastically improve the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime, and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on how neural networks adapt to features of the data. △ Less

Submitted 15 December, 2023; v1 submitted 29 May, 2023; originally announced May 2023.

arXiv:2302.08933 [pdf, other]

Universality laws for Gaussian mixtures in generalized linear models

Authors: Yatin Dandi, Ludovic Stephan, Florent Krzakala, Bruno Loureiro, Lenka Zdeborová

Abstract: Let $(x_{i}, y_{i})_{i=1,\dots,n}$ denote independent samples from a general mixture distribution $\sum_{c\in\mathcal{C}}ρ_{c}P_{c}^{x}$, and consider the hypothesis class of generalized linear models $\hat{y} = F(Θ^{\top}x)$. In this work, we investigate the asymptotic joint statistics of the family of generalized linear estimators $(Θ_{1}, \dots, Θ_{M})$ obtained either from (a) minimizing an em… ▽ More Let $(x_{i}, y_{i})_{i=1,\dots,n}$ denote independent samples from a general mixture distribution $\sum_{c\in\mathcal{C}}ρ_{c}P_{c}^{x}$, and consider the hypothesis class of generalized linear models $\hat{y} = F(Θ^{\top}x)$. In this work, we investigate the asymptotic joint statistics of the family of generalized linear estimators $(Θ_{1}, \dots, Θ_{M})$ obtained either from (a) minimizing an empirical risk $\hat{R}_{n}(Θ;X,y)$ or (b) sampling from the associated Gibbs measure $\exp(-βn \hat{R}_{n}(Θ;X,y))$. Our main contribution is to characterize under which conditions the asymptotic joint statistics of this family depends (on a weak sense) only on the means and covariances of the class conditional features distribution $P_{c}^{x}$. In particular, this allow us to prove the universality of different quantities of interest, such as the training and generalization errors, redeeming a recent line of work in high-dimensional statistics working under the Gaussian mixture hypothesis. Finally, we discuss the applications of our results to different machine learning tasks of interest, such as ensembling and uncertainty △ Less

Submitted 17 February, 2023; originally announced February 2023.

arXiv:2204.06477 [pdf, other]

Data-heterogeneity-aware Mixing for Decentralized Learning

Authors: Yatin Dandi, Anastasia Koloskova, Martin Jaggi, Sebastian U. Stich

Abstract: Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the g… ▽ More Decentralized learning provides an effective framework to train machine learning models with data distributed over arbitrary communication graphs. However, most existing approaches toward decentralized learning disregard the interaction between data heterogeneity and graph topology. In this paper, we characterize the dependence of convergence on the relationship between the mixing weights of the graph and the data heterogeneity across nodes. We propose a metric that quantifies the ability of a graph to mix the current gradients. We further prove that the metric controls the convergence rate, particularly in settings where the heterogeneity across nodes dominates the stochasticity between updates for a given node. Motivated by our analysis, we propose an approach that periodically and efficiently optimizes the metric using standard convex constrained optimization and sketching techniques. Through comprehensive experiments on standard computer vision and NLP benchmarks, we show that our approach leads to improvement in test performance for a wide range of tasks. △ Less

Submitted 13 April, 2022; originally announced April 2022.

arXiv:2111.03972 [pdf, other]

Understanding Layer-wise Contributions in Deep Neural Networks through Spectral Analysis

Authors: Yatin Dandi, Arthur Jacot

Abstract: Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze… ▽ More Spectral analysis is a powerful tool, decomposing any function into simpler parts. In machine learning, Mercer's theorem generalizes this idea, providing for any kernel and input distribution a natural basis of functions of increasing frequency. More recently, several works have extended this analysis to deep neural networks through the framework of Neural Tangent Kernel. In this work, we analyze the layer-wise spectral bias of Deep Neural Networks and relate it to the contributions of different layers in the reduction of generalization error for a given target function. We utilize the properties of Hermite polynomials and Spherical Harmonics to prove that initial layers exhibit a larger bias towards high-frequency functions defined on the unit sphere. We further provide empirical results validating our theory in high dimensional datasets for Deep Neural Networks. △ Less

Submitted 7 January, 2022; v1 submitted 6 November, 2021; originally announced November 2021.

arXiv:2106.13897 [pdf, other]

Implicit Gradient Alignment in Distributed and Federated Learning

Authors: Yatin Dandi, Luis Barba, Martin Jaggi

Abstract: A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. In this work, we show that data heterogeneity can in fact be exploited to improve generalization performance through implicit regularization. One way to alleviate the effects of hetero… ▽ More A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. In this work, we show that data heterogeneity can in fact be exploited to improve generalization performance through implicit regularization. One way to alleviate the effects of heterogeneity is to encourage the alignment of gradients across different clients throughout training. Our analysis reveals that this goal can be accomplished by utilizing the right optimization method that replicates the implicit regularization effect of SGD, leading to gradient alignment as well as improvements in test accuracies. Since the existence of this regularization in SGD completely relies on the sequential use of different mini-batches during training, it is inherently absent when training with large mini-batches. To obtain the generalization benefits of this regularization while increasing parallelism, we propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update. We experimentally validate the benefits of our algorithm in different distributed and federated learning settings. △ Less

Submitted 12 December, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

Comments: AAAI Conference on Artificial Intelligence, 2022

arXiv:2012.02684 [pdf, other]

Model-Agnostic Learning to Meta-Learn

Authors: Arnout Devos, Yatin Dandi

Abstract: In this paper, we propose a learning algorithm that enables a model to quickly exploit commonalities among related tasks from an unseen task distribution, before quickly adapting to specific tasks from that same distribution. We investigate how learning with different task distributions can first improve adaptability by meta-finetuning on related tasks before improving goal task generalization wit… ▽ More In this paper, we propose a learning algorithm that enables a model to quickly exploit commonalities among related tasks from an unseen task distribution, before quickly adapting to specific tasks from that same distribution. We investigate how learning with different task distributions can first improve adaptability by meta-finetuning on related tasks before improving goal task generalization with finetuning. Synthetic regression experiments validate the intuition that learning to meta-learn improves adaptability and consecutively generalization. Experiments on more complex image classification, continual regression, and reinforcement learning tasks demonstrate that learning to meta-learn generally improves task-specific adaptation. The methodology, setup, and hypotheses in this proposal were positively evaluated by peer review before conclusive experiments were carried out. △ Less

Submitted 19 July, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

Comments: Published in Proceedings of Machine Learning Research, PMLR 148:155-175

arXiv:2006.08089 [pdf, other]

Generalized Adversarially Learned Inference

Authors: Yatin Dandi, Homanga Bharadhwaj, Abhishek Kumar, Piyush Rai

Abstract: Allowing effective inference of latent vectors while training GANs can greatly increase their applicability in various downstream tasks. Recent approaches, such as ALI and BiGAN frameworks, develop methods of inference of latent variables in GANs by adversarially training an image generator along with an encoder to match two joint distributions of image and latent vector pairs. We generalize these… ▽ More Allowing effective inference of latent vectors while training GANs can greatly increase their applicability in various downstream tasks. Recent approaches, such as ALI and BiGAN frameworks, develop methods of inference of latent variables in GANs by adversarially training an image generator along with an encoder to match two joint distributions of image and latent vector pairs. We generalize these approaches to incorporate multiple layers of feedback on reconstructions, self-supervision, and other forms of supervision based on prior or learned knowledge about the desired solutions. We achieve this by modifying the discriminator's objective to correctly identify more than two joint distributions of tuples of an arbitrary number of random variables consisting of images, latent vectors, and other variables generated through auxiliary tasks, such as reconstruction and inpainting or as outputs of suitable pre-trained models. We design a non-saturating maximization objective for the generator-encoder pair and prove that the resulting adversarial game corresponds to a global optimum that simultaneously matches all the distributions. Within our proposed framework, we introduce a novel set of techniques for providing self-supervised feedback to the model based on properties, such as patch-level correspondence and cycle consistency of reconstructions. Through comprehensive experiments, we demonstrate the efficacy, scalability, and flexibility of the proposed approach for a variety of tasks. △ Less

Submitted 21 December, 2020; v1 submitted 14 June, 2020; originally announced June 2020.

Comments: AAAI 2021 (accepted for publication)

arXiv:1912.07991 [pdf, other]

Jointly Trained Image and Video Generation using Residual Vectors

Authors: Yatin Dandi, Aniket Das, Soumye Singhal, Vinay P. Namboodiri, Piyush Rai

Abstract: In this work, we propose a modeling technique for jointly training image and video generation models by simultaneously learning to map latent variables with a fixed prior onto real images and interpolate over images to generate videos. The proposed approach models the variations in representations using residual vectors encoding the change at each time step over a summary vector for the entire vid… ▽ More In this work, we propose a modeling technique for jointly training image and video generation models by simultaneously learning to map latent variables with a fixed prior onto real images and interpolate over images to generate videos. The proposed approach models the variations in representations using residual vectors encoding the change at each time step over a summary vector for the entire video. We utilize the technique to jointly train an image generation model with a fixed prior along with a video generation model lacking constraints such as disentanglement. The joint training enables the image generator to exploit temporal information while the video generation model learns to flexibly share information across frames. Moreover, experimental results verify our approach's compatibility with pre-training on videos or images and training on datasets containing a mixture of both. A comprehensive set of quantitative and qualitative evaluations reveal the improvements in sample quality and diversity over both video generation and image generation baselines. We further demonstrate the technique's capabilities of exploiting similarity in features across frames by applying it to a model based on decomposing the video into motion and content. The proposed model allows minor variations in content across frames while maintaining the temporal dependence through latent vectors encoding the pose or motion features. △ Less

Submitted 17 December, 2019; originally announced December 2019.

Comments: Accepted in 2020 Winter Conference on Applications of Computer Vision (WACV '20)

Showing 1–13 of 13 results for author: Dandi, Y