-
Generative modeling through internal high-dimensional chaotic activity
Authors:
Samantha J. Fournier,
Pierfrancesco Urbani
Abstract:
Generative modeling aims at producing new datapoints whose statistical properties resemble the ones in a training dataset. In recent years, there has been a burst of machine learning techniques and settings that can achieve this goal with remarkable performances. In most of these settings, one uses the training dataset in conjunction with noise, which is added as a source of statistical variabilit…
▽ More
Generative modeling aims at producing new datapoints whose statistical properties resemble the ones in a training dataset. In recent years, there has been a burst of machine learning techniques and settings that can achieve this goal with remarkable performances. In most of these settings, one uses the training dataset in conjunction with noise, which is added as a source of statistical variability and is essential for the generative task. Here, we explore the idea of using internal chaotic dynamics in high-dimensional chaotic systems as a way to generate new datapoints from a training dataset. We show that simple learning rules can achieve this goal within a set of vanilla architectures and characterize the quality of the generated datapoints through standard accuracy measures.
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Stochastic Gradient Descent outperforms Gradient Descent in recovering a high-dimensional signal in a glassy energy landscape
Authors:
Persia Jana Kamali,
Pierfrancesco Urbani
Abstract:
Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artificial neural networks. However very little is known on to what extent SGD is crucial for to the success of this technology and, in particular, how much it is effective in optimizing high-dimensional non-convex cost functions as compared to other optimization algorithms such as Gradient Descent (GD).…
▽ More
Stochastic Gradient Descent (SGD) is an out-of-equilibrium algorithm used extensively to train artificial neural networks. However very little is known on to what extent SGD is crucial for to the success of this technology and, in particular, how much it is effective in optimizing high-dimensional non-convex cost functions as compared to other optimization algorithms such as Gradient Descent (GD). In this work we leverage dynamical mean field theory to benchmark its performances in the high-dimensional limit. To do that, we consider the problem of recovering a hidden high-dimensional non-linearly encrypted signal, a prototype high-dimensional non-convex hard optimization problem. We compare the performances of SGD to GD and we show that SGD largely outperforms GD for sufficiently small batch sizes. In particular, a power law fit of the relaxation time of these algorithms shows that the recovery threshold for SGD with small batch size is smaller than the corresponding one of GD.
△ Less
Submitted 18 December, 2023; v1 submitted 9 September, 2023;
originally announced September 2023.
-
The effective noise of Stochastic Gradient Descent
Authors:
Francesca Mignacco,
Pierfrancesco Urbani
Abstract:
Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a…
▽ More
Stochastic Gradient Descent (SGD) is the workhorse algorithm of deep learning technology. At each step of the training phase, a mini batch of samples is drawn from the training dataset and the weights of the neural network are adjusted according to the performance on this specific subset of examples. The mini-batch sampling procedure introduces a stochastic dynamics to the gradient descent, with a non-trivial state-dependent noise. We characterize the stochasticity of SGD and a recently-introduced variant, \emph{persistent} SGD, in a prototypical neural network model. In the under-parametrized regime, where the final training error is positive, the SGD dynamics reaches a stationary state and we define an effective temperature from the fluctuation-dissipation theorem, computed from dynamical mean-field theory. We use the effective temperature to quantify the magnitude of the SGD noise as a function of the problem parameters. In the over-parametrized regime, where the training error vanishes, we measure the noise magnitude of SGD by computing the average distance between two replicas of the system with the same initialization and two different realizations of SGD noise. We find that the two noise measures behave similarly as a function of the problem parameters. Moreover, we observe that noisier algorithms lead to wider decision boundaries of the corresponding constraint satisfaction problem.
△ Less
Submitted 1 June, 2022; v1 submitted 20 December, 2021;
originally announced December 2021.
-
Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem
Authors:
Francesca Mignacco,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
In this paper we investigate how gradient-based algorithms such as gradient descent, (multi-pass) stochastic gradient descent, its persistent variant, and the Langevin algorithm navigate non-convex loss-landscapes and which of them is able to reach the best generalization error at limited sample complexity. We consider the loss landscape of the high-dimensional phase retrieval problem as a prototy…
▽ More
In this paper we investigate how gradient-based algorithms such as gradient descent, (multi-pass) stochastic gradient descent, its persistent variant, and the Langevin algorithm navigate non-convex loss-landscapes and which of them is able to reach the best generalization error at limited sample complexity. We consider the loss landscape of the high-dimensional phase retrieval problem as a prototypical highly non-convex example. We observe that for phase retrieval the stochastic variants of gradient descent are able to reach perfect generalization for regions of control parameters where the gradient descent algorithm is not. We apply dynamical mean-field theory from statistical physics to characterize analytically the full trajectories of these algorithms in their continuous-time limit, with a warm start, and for large system sizes. We further unveil several intriguing properties of the landscape and the algorithms such as that the gradient descent can obtain better generalization properties from less informed initializations.
△ Less
Submitted 13 April, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
Analytical Study of Momentum-Based Acceleration Methods in Paradigmatic High-Dimensional Non-Convex Problems
Authors:
Stefano Sarao Mannelli,
Pierfrancesco Urbani
Abstract:
The optimization step in many machine learning problems rarely relies on vanilla gradient descent but it is common practice to use momentum-based accelerated methods. Despite these algorithms being widely applied to arbitrary loss functions, their behaviour in generically non-convex, high dimensional landscapes is poorly understood. In this work, we use dynamical mean field theory techniques to de…
▽ More
The optimization step in many machine learning problems rarely relies on vanilla gradient descent but it is common practice to use momentum-based accelerated methods. Despite these algorithms being widely applied to arbitrary loss functions, their behaviour in generically non-convex, high dimensional landscapes is poorly understood. In this work, we use dynamical mean field theory techniques to describe analytically the average dynamics of these methods in a prototypical non-convex model: the (spiked) matrix-tensor model. We derive a closed set of equations that describe the behaviour of heavy-ball momentum and Nesterov acceleration in the infinite dimensional limit. By numerical integration of these equations, we observe that these methods speed up the dynamics but do not improve the algorithmic threshold with respect to gradient descent in the spiked model.
△ Less
Submitted 27 October, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
-
Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
Authors:
Stefano Sarao Mannelli,
Giulio Biroli,
Chiara Cammarota,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio…
▽ More
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable develo** a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification
Authors:
Francesca Mignacco,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD c…
▽ More
We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD can be extended to a continuous-time limit that we call stochastic gradient flow. In the full-batch limit, we recover the standard gradient flow. We apply dynamical mean-field theory from statistical physics to track the dynamics of the algorithm in the high-dimensional limit via a self-consistent stochastic process. We explore the performance of the algorithm as a function of the control parameters shedding light on how it navigates the loss landscape.
△ Less
Submitted 9 November, 2021; v1 submitted 10 June, 2020;
originally announced June 2020.
-
Passed & Spurious: Descent Algorithms and Local Minima in Spiked Matrix-Tensor Models
Authors:
Stefano Sarao Mannelli,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of th…
▽ More
In this work we analyse quantitatively the interplay between the loss landscape and performance of descent algorithms in a prototypical inference problem, the spiked matrix-tensor model. We study a loss function that is the negative log-likelihood of the model. We analyse the number of local minima at a fixed distance from the signal/spike with the Kac-Rice formula, and locate trivialization of the landscape at large signal-to-noise ratios. We evaluate in a closed form the performance of a gradient flow algorithm using integro-differential PDEs as developed in physics of disordered systems for the Langevin dynamics. We analyze the performance of an approximate message passing algorithm estimating the maximum likelihood configuration via its state evolution. We conclude by comparing the above results: while we observe a drastic slow down of the gradient flow dynamics even in the region where the landscape is trivial, both the analyzed algorithms are shown to perform well even in the part of the region of parameters where spurious local minima are present.
△ Less
Submitted 20 January, 2020; v1 submitted 31 January, 2019;
originally announced February 2019.
-
Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference
Authors:
Stefano Sarao Mannelli,
Giulio Biroli,
Chiara Cammarota,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor…
▽ More
Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP.
△ Less
Submitted 13 January, 2020; v1 submitted 21 December, 2018;
originally announced December 2018.
-
Approximate Survey Propagation for Statistical Inference
Authors:
Fabrizio Antenucci,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
Approximate message passing algorithm enjoyed considerable attention in the last decade. In this paper we introduce a variant of the AMP algorithm that takes into account glassy nature of the system under consideration. We coin this algorithm as the approximate survey propagation (ASP) and derive it for a class of low-rank matrix estimation problems. We derive the state evolution for the ASP algor…
▽ More
Approximate message passing algorithm enjoyed considerable attention in the last decade. In this paper we introduce a variant of the AMP algorithm that takes into account glassy nature of the system under consideration. We coin this algorithm as the approximate survey propagation (ASP) and derive it for a class of low-rank matrix estimation problems. We derive the state evolution for the ASP algorithm and prove that it reproduces the one-step replica symmetry breaking (1RSB) fixed-point equations, well-known in physics of disordered systems. Our derivation thus gives a concrete algorithmic meaning to the 1RSB equations that is of independent interest. We characterize the performance of ASP in terms of convergence and mean-squared error as a function of the free Parisi parameter s. We conclude that when there is a model mismatch between the true generative model and the inference model, the performance of AMP rapidly degrades both in terms of MSE and of convergence, while ASP converges in a larger regime and can reach lower errors. Among other results, our analysis leads us to a striking hypothesis that whenever s (or other parameters) can be set in such a way that the Nishimori condition $M=Q>0$ is restored, then the corresponding algorithm is able to reach mean-squared error as low as the Bayes-optimal error obtained when the model and its parameters are known and exactly matched in the inference procedure.
△ Less
Submitted 3 July, 2018;
originally announced July 2018.
-
Glassy nature of the hard phase in inference problems
Authors:
Fabrizio Antenucci,
Silvio Franz,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
An algorithmically hard phase was described in a range of inference problems: even if the signal can be reconstructed with a small error from an information theoretic point of view, known algorithms fail unless the noise-to-signal ratio is sufficiently small. This hard phase is typically understood as a metastable branch of the dynamical evolution of message passing algorithms. In this work we stu…
▽ More
An algorithmically hard phase was described in a range of inference problems: even if the signal can be reconstructed with a small error from an information theoretic point of view, known algorithms fail unless the noise-to-signal ratio is sufficiently small. This hard phase is typically understood as a metastable branch of the dynamical evolution of message passing algorithms. In this work we study the metastable branch for a prototypical inference problem, the low-rank matrix factorization, that presents a hard phase. We show that for noise-to-signal ratios that are below the information theoretic threshold, the posterior measure is composed of an exponential number of metastable glassy states and we compute their entropy, called the complexity. We show that this glassiness extends even slightly below the algorithmic threshold below which the well-known approximate message passing (AMP) algorithm is able to closely reconstruct the signal. Counter-intuitively, we find that the performance of the AMP algorithm is not improved by taking into account the glassy nature of the hard phase. This result provides further evidence that the hard phase in inference problems is algorithmically impenetrable for some deep computational reasons that remain to be uncovered.
△ Less
Submitted 9 January, 2019; v1 submitted 15 May, 2018;
originally announced May 2018.