-
Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
Authors:
Stefano Sarao Mannelli,
Giulio Biroli,
Chiara Cammarota,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio…
▽ More
Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable develo** a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model
Authors:
Stefano Sarao Mannelli,
Giulio Biroli,
Chiara Cammarota,
Florent Krzakala,
Lenka Zdeborová
Abstract:
Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones.
Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model.…
▽ More
Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones.
Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model.
Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima.
We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes.
△ Less
Submitted 20 January, 2020; v1 submitted 18 July, 2019;
originally announced July 2019.
-
How to iron out rough landscapes and get optimal performances: Averaged Gradient Descent and its application to tensor PCA
Authors:
Giulio Biroli,
Chiara Cammarota,
Federico Ricci-Tersenghi
Abstract:
In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method…
▽ More
In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method that works for a single data point. The main idea is that a large amount of the roughness is uncorrelated in different parts of the landscape. One can then substantially reduce the noise by evaluating an empirical average of the gradient obtained as a sum over many random independent positions in the space of parameters to be optimized. We present an algorithm, called Averaged Gradient Descent, based on this idea and we apply it to tensor PCA, which is a very hard estimation problem. We show that Averaged Gradient Descent over-performs physical algorithms such as gradient descent and approximate message passing and matches the best algorithmic thresholds known so far, obtained by tensor unfolding and methods based on sum-of-squares.
△ Less
Submitted 6 February, 2020; v1 submitted 29 May, 2019;
originally announced May 2019.
-
Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference
Authors:
Stefano Sarao Mannelli,
Giulio Biroli,
Chiara Cammarota,
Florent Krzakala,
Pierfrancesco Urbani,
Lenka Zdeborová
Abstract:
Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor…
▽ More
Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP.
△ Less
Submitted 13 January, 2020; v1 submitted 21 December, 2018;
originally announced December 2018.
-
Complex energy landscapes in spiked-tensor and simple glassy models: ruggedness, arrangements of local minima and phase transitions
Authors:
Valentina Ros,
Gerard Ben Arous,
Giulio Biroli,
Chiara Cammarota
Abstract:
We study rough high-dimensional landscapes in which an increasingly stronger preference for a given configuration emerges. Such energy landscapes arise in glass physics and inference. In particular we focus on random Gaussian functions, and on the spiked-tensor model and generalizations. We thoroughly analyze the statistical properties of the corresponding landscapes and characterize the associate…
▽ More
We study rough high-dimensional landscapes in which an increasingly stronger preference for a given configuration emerges. Such energy landscapes arise in glass physics and inference. In particular we focus on random Gaussian functions, and on the spiked-tensor model and generalizations. We thoroughly analyze the statistical properties of the corresponding landscapes and characterize the associated geometrical phase transitions. In order to perform our study, we develop a framework based on the Kac-Rice method that allows to compute the complexity of the landscape, i.e. the logarithm of the typical number of stationary points and their Hessian. This approach generalizes the one used to compute rigorously the annealed complexity of mean-field glass models. We discuss its advantages with respect to previous frameworks, in particular the thermodynamical replica method which is shown to lead to partially incorrect predictions.
△ Less
Submitted 24 April, 2018; v1 submitted 8 April, 2018;
originally announced April 2018.
-
Comparing Dynamics: Deep Neural Networks versus Glassy Systems
Authors:
M. Baity-Jesi,
L. Sagun,
M. Geiger,
S. Spigler,
G. Ben Arous,
C. Cammarota,
Y. LeCun,
M. Wyart,
G. Biroli
Abstract:
We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that dur…
▽ More
We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.
△ Less
Submitted 7 June, 2018; v1 submitted 19 March, 2018;
originally announced March 2018.
-
Trend extraction in functional data of R and T waves amplitudes of exercise electrocardiogram
Authors:
Camillo Cammarota,
Mario Curione
Abstract:
The R and T waves amplitudes of the electrocardiogram recorded during the exercise test undergo strong modifications in response to stress. We analyze the time series of these amplitudes in a group of normal subjects in the framework of functional data, performing reduction of dimensionality, smoothing and principal component analysis. These methods show that the R and T amplitudes have opposite r…
▽ More
The R and T waves amplitudes of the electrocardiogram recorded during the exercise test undergo strong modifications in response to stress. We analyze the time series of these amplitudes in a group of normal subjects in the framework of functional data, performing reduction of dimensionality, smoothing and principal component analysis. These methods show that the R and T amplitudes have opposite responses to stress, consisting respectively in a bump and a dip at the early recovery stage. We test these features computing a confidence band for the trend of the population mean and analyzing the zero crossing of its derivative.
Our findings support the existence of a relationship between R and T wave amplitudes and respectively diastolic and systolic ventricular volumes.
△ Less
Submitted 16 February, 2016;
originally announced February 2016.
-
Estimating the turning point location in shifted exponential model of time series
Authors:
Camillo Cammarota
Abstract:
We consider the distribution of the turning point location of time series modeled as the sum of deterministic trend plus random noise. If the variables are modeled by shifted exponentials, whose location parameters define the trend, we provide a formula for computing the distribution of the turning point location and consequently to estimate a confidence interval for the location. We test this for…
▽ More
We consider the distribution of the turning point location of time series modeled as the sum of deterministic trend plus random noise. If the variables are modeled by shifted exponentials, whose location parameters define the trend, we provide a formula for computing the distribution of the turning point location and consequently to estimate a confidence interval for the location. We test this formula in simulated data series having a trend with asymmetric minimum, investigating the coverage rate as a function of a bandwidth parameter. The method is applied to estimate the confidence interval of the minimum location of the time series of RT intervals extracted from the electrocardiogram recorded during the exercise test. We discuss the connection with stochastic ordering.
△ Less
Submitted 17 February, 2016;
originally announced February 2016.