Search | arXiv e-print repository

From Zero to Hero: How local curvature at artless initial conditions leads away from bad minima

Authors: Tony Bonnaire, Giulio Biroli, Chiara Cammarota

Abstract: We investigate the optimization dynamics of gradient descent in a non-convex and high-dimensional setting, with a focus on the phase retrieval problem as a case study for complex loss landscapes. We first study the high-dimensional limit where both the number $M$ and the dimension $N$ of the data are going to infinity at fixed signal-to-noise ratio $α= M/N$. By analyzing how the local curvature ch… ▽ More We investigate the optimization dynamics of gradient descent in a non-convex and high-dimensional setting, with a focus on the phase retrieval problem as a case study for complex loss landscapes. We first study the high-dimensional limit where both the number $M$ and the dimension $N$ of the data are going to infinity at fixed signal-to-noise ratio $α= M/N$. By analyzing how the local curvature changes during optimization, we uncover that for intermediate $α$, the Hessian displays a downward direction pointing towards good minima in the first regime of the descent, before being trapped in bad minima at the end. Hence, the local landscape is benign and informative at first, before gradient descent brings the system into a uninformative maze. The transition between the two regimes is associated to a BBP-type threshold in the time-dependent Hessian. Through both theoretical analysis and numerical experiments, we show that in practical cases, i.e. for finite but even very large $N$, successful optimization via gradient descent in phase retrieval is achieved by falling towards the good minima before reaching the bad ones. This mechanism explains why successful recovery is obtained well before the algorithmic transition corresponding to the high-dimensional limit. Technically, this is associated to strong logarithmic corrections of the algorithmic transition at large $N$ with respect to the one expected in the $N\to\infty$ limit. Our analysis sheds light on such a new mechanism that facilitate gradient descent dynamics in finite large dimensions, also highlighting the importance of good initialization of spectral properties for optimization in complex high-dimensional landscapes. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Comments: 21 pages, 10 figures

arXiv:2006.06997 [pdf, other]

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio… ▽ More Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable develo** a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 9 pages, 5 figures + appendix

Journal ref: Advances in Neural Information Processing Systems, v22, page 3265--327, 2020

arXiv:1907.08226 [pdf, other]

Who is Afraid of Big Bad Minima? Analysis of Gradient-Flow in a Spiked Matrix-Tensor Model

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Lenka Zdeborová

Abstract: Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model.… ▽ More Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes. △ Less

Submitted 20 January, 2020; v1 submitted 18 July, 2019; originally announced July 2019.

Comments: 9 pages, 4 figures + appendix. Appears in Proceedings of the Advances in Neural Information Processing Systems 2019 (NeurIPS 2019)

Journal ref: Advances in Neural Information Processing Systems, pp. 8676-8686. 2019

arXiv:1906.04148 [pdf, other]

Who has the last word? Understanding How to Sample Online Discussions

Authors: Gioia Boschi, Anthony P. Young, Sagar Joglekar, Chiara Cammarota, Nishanth Sastry

Abstract: In online debates individual arguments support or attack each other, leading to some subset of arguments being considered more relevant than others. However, in large discussions readers are often forced to sample a subset of the arguments being put forth. Since such sampling is rarely done in a principled manner, users may not read all the relevant arguments to get a full picture of the debate. T… ▽ More In online debates individual arguments support or attack each other, leading to some subset of arguments being considered more relevant than others. However, in large discussions readers are often forced to sample a subset of the arguments being put forth. Since such sampling is rarely done in a principled manner, users may not read all the relevant arguments to get a full picture of the debate. This paper is interested in answering the question of how users should sample online conversations to selectively favour the currently justified or accepted positions in the debate. We apply techniques from argumentation theory and complex networks to build a model that predicts the probabilities of the normatively justified arguments given their location in online discussions. Our model shows that the proportion of replies that are supportive, the number of replies that comments receive, and the locations of un-replied comments all determine the probability that a comment is a justified argument. We show that when the degree distribution of the number of replies is homogeneous along the discussion, for acrimonious discussions, the distribution of justified arguments depends on the parity of the graph level. In supportive discussions the probability of having justified comments increases as one moves away from the root. For discussion trees that have a non-homogeneous in-degree distribution, for supportive discussions we observe the same behaviour as before, while for acrimonious discussions we cannot observe the same parity-based distribution. This is verified with data obtained from the online debating platform Kialo. By predicting the locations of the justified arguments in reply trees, we can suggest which arguments readers should sample to grasp the currently accepted opinions in such discussions. Our models have important implications for the design of future online debating platforms. △ Less

Submitted 10 April, 2021; v1 submitted 10 June, 2019; originally announced June 2019.

ACM Class: H.5.0; H.5.4; E.1; F.3; G.3

arXiv:1905.12294 [pdf, other]

doi 10.1088/1751-8121/ab7b1f

How to iron out rough landscapes and get optimal performances: Averaged Gradient Descent and its application to tensor PCA

Authors: Giulio Biroli, Chiara Cammarota, Federico Ricci-Tersenghi

Abstract: In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method… ▽ More In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtain a smoother empirical risk. Here we propose a complementary method that works for a single data point. The main idea is that a large amount of the roughness is uncorrelated in different parts of the landscape. One can then substantially reduce the noise by evaluating an empirical average of the gradient obtained as a sum over many random independent positions in the space of parameters to be optimized. We present an algorithm, called Averaged Gradient Descent, based on this idea and we apply it to tensor PCA, which is a very hard estimation problem. We show that Averaged Gradient Descent over-performs physical algorithms such as gradient descent and approximate message passing and matches the best algorithmic thresholds known so far, obtained by tensor unfolding and methods based on sum-of-squares. △ Less

Submitted 6 February, 2020; v1 submitted 29 May, 2019; originally announced May 2019.

Comments: 23 pages, 16 figures, including Supplementary Material

Journal ref: J. Phys. A: Math. Theor. 53, 174003 (2020)

arXiv:1812.09066 [pdf, other]

doi 10.1103/PhysRevX.10.011057

Marvels and Pitfalls of the Langevin Algorithm in Noisy High-dimensional Inference

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor… ▽ More Gradient-descent-based algorithms and their stochastic versions have widespread applications in machine learning and statistical inference. In this work we perform an analytic study of the performances of one of them, the Langevin algorithm, in the context of noisy high-dimensional inference. We employ the Langevin algorithm to sample the posterior probability measure for the spiked matrix-tensor model. The typical behaviour of this algorithm is described by a system of integro-differential equations that we call the Langevin state evolution, whose solution is compared with the one of the state evolution of approximate message passing (AMP). Our results show that, remarkably, the algorithmic threshold of the Langevin algorithm is sub-optimal with respect to the one given by AMP. We conjecture this phenomenon to be due to the residual glassiness present in that region of parameters. Finally we show how a landscape-annealing protocol, that uses the Langevin algorithm but violate the Bayes-optimality condition, can approach the performance of AMP. △ Less

Submitted 13 January, 2020; v1 submitted 21 December, 2018; originally announced December 2018.

Comments: 11 pages and 5 figures + appendix

Journal ref: Phys. Rev. X 10, 011057 (2020)

arXiv:1803.06969 [pdf, other]

doi 10.1088/1742-5468/ab3281

Comparing Dynamics: Deep Neural Networks versus Glassy Systems

Authors: M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. Ben Arous, C. Cammarota, Y. LeCun, M. Wyart, G. Biroli

Abstract: We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that dur… ▽ More We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized. △ Less

Submitted 7 June, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

Comments: 10 pages, 5 figures. Version accepted at ICML 2018

Journal ref: PMLR 80:324-333, 2018; Republication with DOI (cite this one): J. Stat. Mech. (2019) 124013

Showing 1–7 of 7 results for author: Cammarota, C