Search | arXiv e-print repository

arXiv:2006.06997 [pdf, other]

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio… ▽ More Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable develo** a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 9 pages, 5 figures + appendix

Journal ref: Advances in Neural Information Processing Systems, v22, page 3265--327, 2020

arXiv:1907.08226 [pdf, other]

Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Lenka Zdeborová

Abstract: Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model.… ▽ More Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes. △ Less

Submitted 20 January, 2020; v1 submitted 18 July, 2019; originally announced July 2019.

Comments: 9 pages, 4 figures + appendix. Appears in Proceedings of the Advances in Neural Information Processing Systems 2019 (NeurIPS 2019)

Journal ref: Advances in Neural Information Processing Systems, pp. 8676-8686. 2019

arXiv:1905.12294 [pdf, other]

doi 10.1088/1751-8121/ab7b1f

How to iron out rough landscapes and get optimal performances: Averaged Gradient Descent and its application to tensor PCA

Authors: Giulio Biroli, Chiara Cammarota, Federico Ricci-Tersenghi

Abstract: In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method… ▽ More In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method that works for a single data point. The main idea is that a large amount of the roughness is uncorrelated in different parts of the landscape. One can then substantially reduce the noise by evaluating an empirical average of the gradient obtained as a sum over many random independent positions in the space of parameters to be optimized. We present an algorithm, called Averaged Gradient Descent, based on this idea and we apply it to tensor PCA, which is a very hard estimation problem. We show that Averaged Gradient Descent over-performs physical algorithms such as gradient descent and approximate message passing and matches the best algorithmic thresholds known so far, obtained by tensor unfolding and methods based on sum-of-squares. △ Less

Submitted 6 February, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: 23 pages, 16 figures, including Supplementary Material

Journal ref: J. Phys. A: Math. Theor. 53, 174003 (2020)

arXiv:1812.09066 [pdf, other]

doi 10.1103/PhysRevX.10.011057

Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor… ▽ More Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP. △ Less

Submitted 13 January, 2020; v1 submitted 21 December, 2018; originally announced December 2018.

Comments: 11 pages and 5 figures + appendix

Journal ref: Phys. Rev. X 10, 011057 (2020)

arXiv:1804.02686 [pdf, other]

doi 10.1103/PhysRevX.9.011003

Complex energy landscapes in spiked-tensor and simple glassy models: ruggedness, arrangements of local minima and phase transitions

Authors: Valentina Ros, Gerard Ben Arous, Giulio Biroli, Chiara Cammarota

Abstract: We study rough high-dimensional landscapes in which an increasingly stronger preference for a given configuration emerges. Such energy landscapes arise in glass physics and inference. In particular we focus on random Gaussian functions, and on the spiked-tensor model and generalizations. We thoroughly analyze the statistical properties of the corresponding landscapes and characterize the associate… ▽ More We study rough high-dimensional landscapes in which an increasingly stronger preference for a given configuration emerges. Such energy landscapes arise in glass physics and inference. In particular we focus on random Gaussian functions, and on the spiked-tensor model and generalizations. We thoroughly analyze the statistical properties of the corresponding landscapes and characterize the associated geometrical phase transitions. In order to perform our study, we develop a framework based on the Kac-Rice method that allows to compute the complexity of the landscape, i.e. the logarithm of the typical number of stationary points and their Hessian. This approach generalizes the one used to compute rigorously the annealed complexity of mean-field glass models. We discuss its advantages with respect to previous frameworks, in particular the thermodynamical replica method which is shown to lead to partially incorrect predictions. △ Less

Submitted 24 April, 2018; v1 submitted 8 April, 2018; originally announced April 2018.

Comments: v2 with references added, typos corrected

Journal ref: Phys. Rev. X 9, 011003 (2019)

arXiv:1803.06969 [pdf, other]

doi 10.1088/1742-5468/ab3281

Comparing Dynamics: Deep Neural Networks versus Glassy Systems

Authors: M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous, C. Cammarota, Y. LeCun, M. Wyart, G. Biroli

Abstract: We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that dur… ▽ More We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized. △ Less

Submitted 7 June, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

Comments: 10 pages, 5 figures. Version accepted at ICML 2018

Journal ref: PMLR 80:324-333, 2018; Republication with DOI (cite this one): J. Stat. Mech. (2019) 124013

arXiv:1602.05907 [pdf, other]

Trend extraction in functional data of R and T waves amplitudes of exercise electrocardiogram

Authors: Camillo Cammarota, Mario Curione

Abstract: The R and T waves amplitudes of the electrocardiogram recorded during the exercise test undergo strong modifications in response to stress. We analyze the time series of these amplitudes in a group of normal subjects in the framework of functional data, performing reduction of dimensionality, smoothing and principal component analysis. These methods show that the R and T amplitudes have opposite r… ▽ More The R and T waves amplitudes of the electrocardiogram recorded during the exercise test undergo strong modifications in response to stress. We analyze the time series of these amplitudes in a group of normal subjects in the framework of functional data, performing reduction of dimensionality, smoothing and principal component analysis. These methods show that the R and T amplitudes have opposite responses to stress, consisting respectively in a bump and a dip at the early recovery stage. We test these features computing a confidence band for the trend of the population mean and analyzing the zero crossing of its derivative. Our findings support the existence of a relationship between R and T wave amplitudes and respectively diastolic and systolic ventricular volumes. △ Less

Submitted 16 February, 2016; originally announced February 2016.

Report number: Roma01.Math.MP

arXiv:1602.05906 [pdf, other]

Estimating the turning point location in shifted exponential model of time series

Authors: Camillo Cammarota

Abstract: We consider the distribution of the turning point location of time series modeled as the sum of deterministic trend plus random noise. If the variables are modeled by shifted exponentials, whose location parameters define the trend, we provide a formula for computing the distribution of the turning point location and consequently to estimate a confidence interval for the location. We test this for… ▽ More We consider the distribution of the turning point location of time series modeled as the sum of deterministic trend plus random noise. If the variables are modeled by shifted exponentials, whose location parameters define the trend, we provide a formula for computing the distribution of the turning point location and consequently to estimate a confidence interval for the location. We test this formula in simulated data series having a trend with asymmetric minimum, investigating the coverage rate as a function of a bandwidth parameter. The method is applied to estimate the confidence interval of the minimum location of the time series of RT intervals extracted from the electrocardiogram recorded during the exercise test. We discuss the connection with stochastic ordering. △ Less

Submitted 17 February, 2016; originally announced February 2016.

Report number: Roma01.Math.MP

Showing 1–8 of 8 results for author: Cammarota, C