Search | arXiv e-print repository

Optimal thresholds and algorithms for a model of multi-modal learning in high dimensions

Authors: Christian Keup, Lenka Zdeborová

Abstract: This work explores multi-modal inference in a high-dimensional simplified model, analytically quantifying the performance gain of multi-modal inference over that of analyzing modalities in isolation. We present the Bayes-optimal performance and weak recovery thresholds in a model where the objective is to recover the latent structures from two noisy data matrices with correlated spikes. The paper… ▽ More This work explores multi-modal inference in a high-dimensional simplified model, analytically quantifying the performance gain of multi-modal inference over that of analyzing modalities in isolation. We present the Bayes-optimal performance and weak recovery thresholds in a model where the objective is to recover the latent structures from two noisy data matrices with correlated spikes. The paper derives the approximate message passing (AMP) algorithm for this model and characterizes its performance in the high-dimensional limit via the associated state evolution. The analysis holds for a broad range of priors and noise channels, which can differ across modalities. The linearization of AMP is compared numerically to the widely used partial least squares (PLS) and canonical correlation analysis (CCA) methods, which are both observed to suffer from a sub-optimal recovery threshold. △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2405.15480 [pdf, other]

Fundamental limits of weak learnability in high-dimensional multi-index models

Authors: Emanuele Troiani, Yatin Dandi, Leonardo Defilippis, Lenka Zdeborová, Bruno Loureiro, Florent Krzakala

Abstract: Multi-index models -- functions which only depend on the covariates through a non-linear transformation of their projection on a subspace -- are a useful benchmark for investigating feature learning with neural networks. This paper examines the theoretical boundaries of learnability in this hypothesis class, focusing particularly on the minimum sample complexity required for weakly recovering thei… ▽ More Multi-index models -- functions which only depend on the covariates through a non-linear transformation of their projection on a subspace -- are a useful benchmark for investigating feature learning with neural networks. This paper examines the theoretical boundaries of learnability in this hypothesis class, focusing particularly on the minimum sample complexity required for weakly recovering their low-dimensional structure with first-order iterative algorithms, in the high-dimensional regime where the number of samples is $n=αd$ is proportional to the covariate dimension $d$. Our findings unfold in three parts: (i) first, we identify under which conditions a \textit{trivial subspace} can be learned with a single step of a first-order algorithm for any $α\!>\!0$; (ii) second, in the case where the trivial subspace is empty, we provide necessary and sufficient conditions for the existence of an {\it easy subspace} consisting of directions that can be learned only above a certain sample complexity $α\!>\!α_c$. The critical threshold $α_{c}$ marks the presence of a computational phase transition, in the sense that no efficient iterative algorithm can succeed for $α\!<\!α_c$. In a limited but interesting set of really hard directions -- akin to the parity problem -- $α_c$ is found to diverge. Finally, (iii) we demonstrate that interactions between different directions can result in an intricate hierarchical learning phenomenon, where some directions can be learned sequentially when coupled to easier ones. Our analytical approach is built on the optimality of approximate message-passing algorithms among first-order iterative methods, delineating the fundamental learnability limit across a broad spectrum of algorithms, including neural networks trained with gradient descent. △ Less

Submitted 24 May, 2024; originally announced May 2024.

arXiv:2405.10763 [pdf, other]

Integer Traffic Assignment Problem: Algorithms and Insights on Random Graphs

Authors: Rayan Harfouche, Giovanni Piccioli, Lenka Zdeborová

Abstract: Path optimization is a fundamental concern across various real-world scenarios, ranging from traffic congestion issues to efficient data routing over the internet. The Traffic Assignment Problem (TAP) is a classic continuous optimization problem in this field. This study considers the Integer Traffic Assignment Problem (ITAP), a discrete variant of TAP. ITAP involves determining optimal routes for… ▽ More Path optimization is a fundamental concern across various real-world scenarios, ranging from traffic congestion issues to efficient data routing over the internet. The Traffic Assignment Problem (TAP) is a classic continuous optimization problem in this field. This study considers the Integer Traffic Assignment Problem (ITAP), a discrete variant of TAP. ITAP involves determining optimal routes for commuters in a city represented by a graph, aiming to minimize congestion while adhering to integer flow constraints on paths. This restriction makes ITAP an NP-hard problem. While conventional TAP prioritizes repulsive interactions to minimize congestion, this work also explores the case of attractive interactions, related to minimizing the number of occupied edges. We present and evaluate multiple algorithms to address ITAP, including a message passing algorithm, a greedy approach, simulated annealing, and relaxation of ITAP to TAP. Inspired by studies of random ensembles in the large-size limit in statistical physics, comparisons between these algorithms are conducted on large sparse random regular graphs with a random set of origin-destination pairs. Our results indicate that while the simplest greedy algorithm performs competitively in the repulsive scenario, in the attractive case the message-passing-based algorithm and simulated annealing demonstrate superiority. We then investigate the relationship between TAP and ITAP in the repulsive case. We find that, as the number of paths increases, the solution of TAP converges toward that of ITAP, and we investigate the speed of this convergence. Depending on the number of paths, our analysis leads us to identify two scaling regimes: in one the average flow per edge is of order one, and in another the number of paths scales quadratically with the size of the graph, in which case the continuous relaxation solves the integer problem closely. △ Less

Submitted 17 May, 2024; originally announced May 2024.

Comments: 37 pages, 15 figures

arXiv:2403.04234 [pdf, other]

Fundamental limits of Non-Linear Low-Rank Matrix Estimation

Authors: Pierre Mergny, Justin Ko, Florent Krzakala, Lenka Zdeborová

Abstract: We consider the task of estimating a low-rank matrix from non-linear and noisy observations. We prove a strong universality result showing that Bayes-optimal performances are characterized by an equivalent Gaussian model with an effective prior, whose parameters are entirely determined by an expansion of the non-linear function. In particular, we show that to reconstruct the signal accurately, one… ▽ More We consider the task of estimating a low-rank matrix from non-linear and noisy observations. We prove a strong universality result showing that Bayes-optimal performances are characterized by an equivalent Gaussian model with an effective prior, whose parameters are entirely determined by an expansion of the non-linear function. In particular, we show that to reconstruct the signal accurately, one requires a signal-to-noise ratio growing as $N^{\frac 12 (1-1/k_F)}$, where $k_F$ is the first non-zero Fisher information coefficient of the function. We provide asymptotic characterization for the minimal achievable mean squared error (MMSE) and an approximate message-passing algorithm that reaches the MMSE under conditions analogous to the linear version of the problem. We also provide asymptotic errors achieved by methods such as principal component analysis combined with Bayesian denoising, and compare them with Bayes-optimal MMSE. △ Less

Submitted 7 March, 2024; originally announced March 2024.

Comments: 42 pages, 2 figures

arXiv:2402.13622 [pdf, ps, other]

Analysis of Bootstrap and Subsampling in High-dimensional Regularized Regression

Authors: Lucas Clarté, Adrien Vandenbroucque, Guillaume Dalle, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, ta… ▽ More We investigate popular resampling methods for estimating the uncertainty of statistical models, such as subsampling, bootstrap and the jackknife, and their performance in high-dimensional supervised regression tasks. We provide a tight asymptotic description of the biases and variances estimated by these methods in the context of generalized linear models, such as ridge and logistic regression, taking the limit where the number of samples $n$ and dimension $d$ of the covariates grow at a comparable fixed rate $α\!=\! n/d$. Our findings are three-fold: i) resampling methods are fraught with problems in high dimensions and exhibit the double-descent-like behavior typical of these situations; ii) only when $α$ is large enough do they provide consistent and reliable error estimations (we give convergence rates); iii) in the over-parametrized regime $α\!<\!1$ relevant to modern machine learning practice, their predictions are not consistent, even with optimal regularization. △ Less

Submitted 21 February, 2024; originally announced February 2024.

arXiv:2402.04980 [pdf, other]

Asymptotics of feature learning in two-layer networks after one gradient-step

Authors: Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M. Lu, Lenka Zdeborová, Bruno Loureiro

Abstract: In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), w… ▽ More In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al., 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. △ Less

Submitted 4 June, 2024; v1 submitted 7 February, 2024; originally announced February 2024.

arXiv:2402.03902 [pdf, other]

A phase transition between positional and semantic learning in a solvable model of dot-product attention

Authors: Hugo Cui, Freya Behrens, Florent Krzakala, Lenka Zdeborová

Abstract: We investigate how a dot-product attention layer learns a positional attention matrix (with tokens attending to each other based on their respective positions) and a semantic attention matrix (with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally show how the same simple architecture can learn to implement a solution using either the positional or… ▽ More We investigate how a dot-product attention layer learns a positional attention matrix (with tokens attending to each other based on their respective positions) and a semantic attention matrix (with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally show how the same simple architecture can learn to implement a solution using either the positional or semantic mechanism. On the theoretical side, we study the learning of a non-linear self-attention layer with trainable tied and low-rank query and key matrices. In the asymptotic limit of high-dimensional data and a comparably large number of training samples, we provide a closed-form characterization of the global minimum of the non-convex empirical loss landscape. We show that this minimum corresponds to either a positional or a semantic mechanism and evidence an emergent phase transition from the former to the latter with increasing sample complexity. Finally, we compare the dot-product attention layer to linear positional baseline, and show that it outperforms the latter using the semantic mechanism provided it has access to sufficient data. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.03818 [pdf, other]

Asymptotic generalization error of a single-layer graph convolutional network

Authors: O. Duranthon, L. Zdeborová

Abstract: While graph convolutional networks show great practical promises, the theoretical understanding of their generalization properties as a function of the number of samples is still in its infancy compared to the more broadly studied case of supervised fully connected neural networks. In this article, we predict the performances of a single-layer graph convolutional network (GCN) trained on data prod… ▽ More While graph convolutional networks show great practical promises, the theoretical understanding of their generalization properties as a function of the number of samples is still in its infancy compared to the more broadly studied case of supervised fully connected neural networks. In this article, we predict the performances of a single-layer graph convolutional network (GCN) trained on data produced by attributed stochastic block models (SBMs) in the high-dimensional limit. Previously, only ridge regression on contextual-SBM (CSBM) has been considered in Shi et al. 2022; we generalize the analysis to arbitrary convex loss and regularization for the CSBM and add the analysis for another data model, the neural-prior SBM. We also study the high signal-to-noise ratio limit, detail the convergence rates of the GCN and show that, while consistent, it does not reach the Bayes-optimal rate for any of the considered cases. △ Less

Submitted 20 March, 2024; v1 submitted 6 February, 2024; originally announced February 2024.

arXiv:2402.03220 [pdf, other]

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

Authors: Yatin Dandi, Emanuele Troiani, Luca Arnaboldi, Luca Pesce, Lenka Zdeborová, Florent Krzakala

Abstract: We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the… ▽ More We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al., 2021) and leap exponent (Abbe et al., 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al., 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory. △ Less

Submitted 30 June, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: Accepted at the International Conference on Machine Learning (ICML), 2024

arXiv:2310.03575 [pdf, other]

Analysis of learning a flow-based generative model from limited sample complexity

Authors: Hugo Cui, Florent Krzakala, Eric Vanden-Eijnden, Lenka Zdeborová

Abstract: We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number $n$ of samples from th… ▽ More We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number $n$ of samples from the target distribution. Building on this analysis, we provide a sharp description of the corresponding generative flow, which pushes the base Gaussian density forward to an approximation of the target density. In particular, we provide closed-form formulae for the distance between the mean of the generated mixture and the mean of the target mixture, which we show decays as $Θ_n(\frac{1}{n})$. Finally, this rate is shown to be in fact Bayes-optimal. △ Less

Submitted 25 June, 2024; v1 submitted 5 October, 2023; originally announced October 2023.

arXiv:2308.14085 [pdf, other]

Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective

Authors: Davide Ghio, Yatin Dandi, Florent Krzakala, Lenka Zdeborová

Abstract: Recent years witnessed the development of powerful generative models based on flows, diffusion or autoregressive neural networks, achieving remarkable success in generating data from examples with applications in a broad range of areas. A theoretical analysis of the performance and understanding of the limitations of these methods remain, however, challenging. In this paper, we undertake a step in… ▽ More Recent years witnessed the development of powerful generative models based on flows, diffusion or autoregressive neural networks, achieving remarkable success in generating data from examples with applications in a broad range of areas. A theoretical analysis of the performance and understanding of the limitations of these methods remain, however, challenging. In this paper, we undertake a step in this direction by analysing the efficiency of sampling by these methods on a class of problems with a known probability distribution and comparing it with the sampling performance of more traditional methods such as the Monte Carlo Markov chain and Langevin dynamics. We focus on a class of probability distribution widely studied in the statistical physics of disordered systems that relate to spin glasses, statistical inference and constraint satisfaction problems. We leverage the fact that sampling via flow-based, diffusion-based or autoregressive networks methods can be equivalently mapped to the analysis of a Bayes optimal denoising of a modified probability measure. Our findings demonstrate that these methods encounter difficulties in sampling stemming from the presence of a first-order phase transition along the algorithm's denoising path. Our conclusions go both ways: we identify regions of parameters where these methods are unable to sample efficiently, while that is possible using standard Monte Carlo or Langevin approaches. We also identify regions where the opposite happens: standard approaches are inefficient while the discussed generative methods work well. △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: 39 pages, 12 figures

arXiv:2306.07948 [pdf, other]

Optimal Inference in Contextual Stochastic Block Models

Authors: O. Duranthon, L. Zdeborová

Abstract: The contextual stochastic block model (cSBM) was proposed for unsupervised community detection on attributed graphs where both the graph and the high-dimensional node information correlate with node labels. In the context of machine learning on graphs, the cSBM has been widely used as a synthetic dataset for evaluating the performance of graph-neural networks (GNNs) for semi-supervised node classi… ▽ More The contextual stochastic block model (cSBM) was proposed for unsupervised community detection on attributed graphs where both the graph and the high-dimensional node information correlate with node labels. In the context of machine learning on graphs, the cSBM has been widely used as a synthetic dataset for evaluating the performance of graph-neural networks (GNNs) for semi-supervised node classification. We consider a probabilistic Bayes-optimal formulation of the inference problem and we derive a belief-propagation-based algorithm for the semi-supervised cSBM; we conjecture it is optimal in the considered setting and we provide its implementation. We show that there can be a considerable gap between the accuracy reached by this algorithm and the performance of the GNN architectures proposed in the literature. This suggests that the cSBM, along with the comparison to the performance of the optimal algorithm, readily accessible via our implementation, can be instrumental in the development of more performant GNN architectures. △ Less

Submitted 5 March, 2024; v1 submitted 6 June, 2023; originally announced June 2023.

Journal ref: TMLR 2024

arXiv:2306.02729 [pdf, other]

Gibbs Sampling the Posterior of Neural Networks

Authors: Giovanni Piccioli, Emanuele Troiani, Lenka Zdeborová

Abstract: In this paper, we study sampling from a posterior derived from a neural network. We propose a new probabilistic model consisting of adding noise at every pre- and post-activation in the network, arguing that the resulting posterior can be sampled using an efficient Gibbs sampler. For small models, the Gibbs sampler attains similar performances as the state-of-the-art Markov chain Monte Carlo (MCMC… ▽ More In this paper, we study sampling from a posterior derived from a neural network. We propose a new probabilistic model consisting of adding noise at every pre- and post-activation in the network, arguing that the resulting posterior can be sampled using an efficient Gibbs sampler. For small models, the Gibbs sampler attains similar performances as the state-of-the-art Markov chain Monte Carlo (MCMC) methods, such as the Hamiltonian Monte Carlo (HMC) or the Metropolis adjusted Langevin algorithm (MALA), both on real and synthetic data. By framing our analysis in the teacher-student setting, we introduce a thermalization criterion that allows us to detect when an algorithm, when run on data with synthetic labels, fails to sample from the posterior. The criterion is based on the fact that in the teacher-student setting we can initialize an algorithm directly at equilibrium. △ Less

Submitted 11 January, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

arXiv:2305.11041 [pdf, other]

High-dimensional Asymptotics of Denoising Autoencoders

Authors: Hugo Cui, Lenka Zdeborová

Abstract: We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where the number of training samples and the input dimension jointly tend to infinity while the number of hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared test error.… ▽ More We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where the number of training samples and the input dimension jointly tend to infinity while the number of hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared test error. Building on this result, we quantitatively characterize the advantage of the considered architecture over the autoencoder without the skip connection that relates closely to principal component analysis. We further show that our results accurately capture the learning curves on a range of real data sets. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2304.12127 [pdf, other]

Compressed sensing with l0-norm: statistical physics analysis and algorithms for signal recovery

Authors: D. Barbier, C Lucibello, L. Saglietti, F. Krzakala, L. Zdeborova

Abstract: Noiseless compressive sensing is a protocol that enables undersampling and later recovery of a signal without loss of information. This compression is possible because the signal is usually sufficiently sparse in a given basis. Currently, the algorithm offering the best tradeoff between compression rate, robustness, and speed for compressive sensing is the LASSO (l1-norm bias) algorithm. However,… ▽ More Noiseless compressive sensing is a protocol that enables undersampling and later recovery of a signal without loss of information. This compression is possible because the signal is usually sufficiently sparse in a given basis. Currently, the algorithm offering the best tradeoff between compression rate, robustness, and speed for compressive sensing is the LASSO (l1-norm bias) algorithm. However, many studies have pointed out the possibility that the implementation of lp-norms biases, with p smaller than one, could give better performance while sacrificing convexity. In this work, we focus specifically on the extreme case of the l0-based reconstruction, a task that is complicated by the discontinuity of the loss. In the first part of the paper, we describe via statistical physics methods, and in particular the replica method, how the solutions to this optimization problem are arranged in a clustered structure. We observe two distinct regimes: one at low compression rate where the signal can be recovered exactly, and one at high compression rate where the signal cannot be recovered accurately. In the second part, we present two message-passing algorithms based on our first results for the l0-norm optimization problem. The proposed algorithms are able to recover the signal at compression rates higher than the ones achieved by LASSO while being computationally efficient. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Journal ref: Proceedings of ITW 2023

arXiv:2303.09995 [pdf, other]

doi 10.1088/2632-2153/ace60f

Neural-prior stochastic block model

Authors: O. Duranthon, L. Zdeborová

Abstract: The stochastic block model (SBM) is widely studied as a benchmark for graph clustering aka community detection. In practice, graph data often come with node attributes that bear additional information about the communities. Previous works modeled such data by considering that the node attributes are generated from the node community memberships. In this work, motivated by a recent surge of works i… ▽ More The stochastic block model (SBM) is widely studied as a benchmark for graph clustering aka community detection. In practice, graph data often come with node attributes that bear additional information about the communities. Previous works modeled such data by considering that the node attributes are generated from the node community memberships. In this work, motivated by a recent surge of works in signal processing using deep neural networks as priors, we propose to model the communities as being determined by the node attributes rather than the opposite. We define the corresponding model; we call it the neural-prior SBM. We propose an algorithm, stemming from statistical physics, based on a combination of belief propagation and approximate message passing. We analyze the performance of the algorithm as well as the Bayes-optimal performance. We identify detectability and exact recovery phase transitions, as well as an algorithmically hard region. The proposed model and algorithm can be used as a benchmark for both theory and algorithms. To illustrate this, we compare the optimal performances to the performance of simple graph neural networks. △ Less

Submitted 6 September, 2023; v1 submitted 17 March, 2023; originally announced March 2023.

Journal ref: Mach. Learn.: Sci. Technol. 4 035017 (2023)

arXiv:2303.05237 [pdf, other]

doi 10.1088/1742-5468/ad1391

Statistical mechanics of the maximum-average submatrix problem

Authors: Vittorio Erba, Florent Krzakala, Rodrigo Pérez, Lenka Zdeborová

Abstract: We study the maximum-average submatrix problem, in which given an $N \times N$ matrix $J$ one needs to find the $k \times k$ submatrix with the largest average of entries. We study the problem for random matrices $J$ whose entries are i.i.d. random variables by map** it to a variant of the Sherrington-Kirkpatrick spin-glass model at fixed magnetization. We characterize analytically the phase dia… ▽ More We study the maximum-average submatrix problem, in which given an $N \times N$ matrix $J$ one needs to find the $k \times k$ submatrix with the largest average of entries. We study the problem for random matrices $J$ whose entries are i.i.d. random variables by map** it to a variant of the Sherrington-Kirkpatrick spin-glass model at fixed magnetization. We characterize analytically the phase diagram of the model as a function of the submatrix average and the size of the submatrix $k$ in the limit $N\to\infty$. We consider submatrices of size $k = m N$ with $0 < m < 1$. We find a rich phase diagram, including dynamical, static one-step replica symmetry breaking and full-step replica symmetry breaking. In the limit of $m \to 0$, we find a simpler phase diagram featuring a frozen 1-RSB phase, where the Gibbs measure is composed of exponentially many pure states each with zero entropy. We discover an interesting phenomenon, reminiscent of the phenomenology of the binary perceptron: there exist efficient algorithms that provably work in the frozen 1-RSB phase. △ Less

Submitted 21 September, 2023; v1 submitted 9 March, 2023; originally announced March 2023.

Journal ref: J. Stat. Mech. (2024) 013403

arXiv:2303.02644 [pdf, other]

Expectation consistency for calibration of neural networks

Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: Despite their incredible performance, it is well reported that deep neural networks tend to be overoptimistic about their prediction confidence. Finding effective and efficient calibration methods for neural networks is therefore an important endeavour towards better uncertainty quantification in deep learning. In this manuscript, we introduce a novel calibration technique named expectation consis… ▽ More Despite their incredible performance, it is well reported that deep neural networks tend to be overoptimistic about their prediction confidence. Finding effective and efficient calibration methods for neural networks is therefore an important endeavour towards better uncertainty quantification in deep learning. In this manuscript, we introduce a novel calibration technique named expectation consistency (EC), consisting of a post-training rescaling of the last layer weights by enforcing that the average validation confidence coincides with the average proportion of correct labels. First, we show that the EC method achieves similar calibration performance to temperature scaling (TS) across different neural network architectures and data sets, all while requiring similar validation samples and computational resources. However, we argue that EC provides a principled method grounded on a Bayesian optimality principle known as the Nishimori identity. Next, we provide an asymptotic characterization of both TS and EC in a synthetic setting and show that their performance crucially depends on the target function. In particular, we discuss examples where EC significantly outperforms TS. △ Less

Submitted 4 August, 2023; v1 submitted 5 March, 2023; originally announced March 2023.

Journal ref: Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, PMLR 216:443-453, 2023

arXiv:2302.00375 [pdf, other]

Bayes-optimal Learning of Deep Random Networks of Extensive-width

Authors: Hugo Cui, Florent Krzakala, Lenka Zdeborová

Abstract: We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We propose a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We furt… ▽ More We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large. We propose a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We further compute closed-form expressions for the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples. △ Less

Submitted 21 June, 2023; v1 submitted 1 February, 2023; originally announced February 2023.

Journal ref: Proceedings of the 40th International Conference on Machine Learning, PMLR 202:6468-6521, 2023

arXiv:2210.12760 [pdf, ps, other]

On double-descent in uncertainty quantification in overparametrized models

Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often su… ▽ More Uncertainty quantification is a central challenge in reliable and trustworthy machine learning. Naive measures such as last-layer scores are well-known to yield overconfident estimates in the context of overparametrized neural networks. Several methods, ranging from temperature scaling to different Bayesian treatments of neural networks, have been proposed to mitigate overconfidence, most often supported by the numerical observation that they yield better calibrated uncertainty measures. In this work, we provide a sharp comparison between popular uncertainty measures for binary classification in a mathematically tractable model for overparametrized neural networks: the random features model. We discuss a trade-off between classification accuracy and calibration, unveiling a double descent like behavior in the calibration curve of optimally regularized estimators as a function of overparametrization. This is in contrast with the empirical Bayes method, which we show to be well calibrated in our setting despite the higher generalization error and overparametrization. △ Less

Submitted 23 May, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

Journal ref: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics (2023), PMLR 206:7089-7125

arXiv:2210.08312 [pdf, other]

doi 10.1088/1742-5468/ac9cc8

Disordered Systems Insights on Computational Hardness

Authors: David Gamarnik, Cristopher Moore, Lenka Zdeborová

Abstract: In this review article, we discuss connections between the physics of disordered systems, phase transitions in inference problems, and computational hardness. We introduce two models representing the behavior of glassy systems, the spiked tensor model and the generalized linear model. We discuss the random (non-planted) versions of these problems as prototypical optimization problems, as well as t… ▽ More In this review article, we discuss connections between the physics of disordered systems, phase transitions in inference problems, and computational hardness. We introduce two models representing the behavior of glassy systems, the spiked tensor model and the generalized linear model. We discuss the random (non-planted) versions of these problems as prototypical optimization problems, as well as the planted versions (with a hidden solution) as prototypical problems in statistical inference and learning. Based on ideas from physics, many of these problems have transitions where they are believed to jump from easy (solvable in polynomial time) to hard (requiring exponential time). We discuss several emerging ideas in theoretical computer science and statistics that provide rigorous evidence for hardness by proving that large classes of algorithms fail in the conjectured hard regime. This includes the overlap gap property, a particular mathematization of clustering or dynamical symmetry-breaking, which can be used to show that many algorithms that are local or robust to changes in their input fail. We also discuss the sum-of-squares hierarchy, which places bounds on proofs or algorithms that use low-degree polynomials such as standard spectral methods and semidefinite relaxations, including the Sherrington-Kirkpatrick model. Throughout the manuscript, we present connections to the physics of disordered systems and associated replica symmetry breaking properties. △ Less

Submitted 18 October, 2022; v1 submitted 15 October, 2022; originally announced October 2022.

Comments: 42 pages

Journal ref: J. Stat. Mech. (2022) 114015

arXiv:2210.06591 [pdf, other]

Rigorous dynamical mean field theory for stochastic gradient descent methods

Authors: Cedric Gerbelot, Emanuele Troiani, Francesca Mignacco, Florent Krzakala, Lenka Zdeborova

Abstract: We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match th… ▽ More We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match those resulting from the discretization of dynamical mean-field theory (DMFT) equations from statistical physics when applied to gradient flow. Our proof method allows us to give an explicit description of how memory kernels build up in the effective dynamics, and to include non-separable update functions, allowing datasets with non-identity covariance matrices. Finally, we provide numerical implementations of the equations for SGD with generic extensive batch-size and with constant learning rates. △ Less

Submitted 29 November, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: 40 pages, 4 figures

arXiv:2209.03423 [pdf, other]

doi 10.1103/PhysRevE.106.054302

Planted matching problems on random hypergraphs

Authors: Urte Adomaityte, Anshul Toshniwal, Gabriele Sicuro, Lenka Zdeborová

Abstract: We consider the problem of inferring a matching hidden in a weighted random $k$-hypergraph. We assume that the hyperedges' weights are random and distributed according to two different densities conditioning on the fact that they belong to the hidden matching, or not. We show that, for $k>2$ and in the large graph size limit, an algorithmic first order transition in the signal strength separates a… ▽ More We consider the problem of inferring a matching hidden in a weighted random $k$-hypergraph. We assume that the hyperedges' weights are random and distributed according to two different densities conditioning on the fact that they belong to the hidden matching, or not. We show that, for $k>2$ and in the large graph size limit, an algorithmic first order transition in the signal strength separates a regime in which a complete recovery of the hidden matching is feasible from a regime in which partial recovery is possible. This is in contrast to the $k=2$ case where the transition is known to be continuous. Finally, we consider the case of graphs presenting a mixture of edges and $3$-hyperedges, interpolating between the $k=2$ and the $k=3$ cases, and we study how the transition changes from continuous to first order by tuning the relative amount of edges and hyperedges. △ Less

Submitted 7 September, 2022; originally announced September 2022.

Comments: 13 pages, 12 figures

Journal ref: Phys. Rev. E 106, 054302 (2022)

arXiv:2208.06488 [pdf, other]

doi 10.1103/PhysRevE.106.054115

The planted XY model: thermodynamics and inference

Authors: Siyu Chen, Guanhao Huang, Giovanni Piccioli, Lenka Zdeborová

Abstract: In this paper we study a fully connected planted spin glass named the planted XY model. Motivation for studying this system comes both from the spin glass field and the one of statistical inference where it models the angular synchronization problem. We derive the replica symmetric (RS) phase diagram in the temperature, ferromagnetic bias plane using the approximate message passing (AMP) algorithm… ▽ More In this paper we study a fully connected planted spin glass named the planted XY model. Motivation for studying this system comes both from the spin glass field and the one of statistical inference where it models the angular synchronization problem. We derive the replica symmetric (RS) phase diagram in the temperature, ferromagnetic bias plane using the approximate message passing (AMP) algorithm and its state evolution (SE). While the RS predictions are exact on the Nishimori line (i.e. when the temperature is matched to the ferromagnetic bias), they become inaccurate when the parameters are mismatched, giving rise to a spin glass phase where AMP is not able to converge. To overcome the defects of the RS approximation we carry out a one-step replica symmetry breaking (1RSB) analysis based on the approximate survey propagation (ASP) algorithm. Exploiting the state evolution of ASP, we count the number of metastable states in the measure, derive the 1RSB free entropy and find the behavior of the Parisi parameter throughout the spin glass phase. △ Less

Submitted 11 January, 2024; v1 submitted 12 August, 2022; originally announced August 2022.

Comments: 29 pages, 8 figures

Journal ref: Phys. Rev. E 106, 054115 (2022)

arXiv:2205.13527 [pdf, other]

Subspace clustering in high-dimensions: Phase transitions & Statistical-to-Computational gap

Authors: Luca Pesce, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the r… ▽ More A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model where the cluster means are sparse vectors. Here we provide an exact asymptotic characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity, i.e. when the fraction of non-zero components of the cluster means $ρ$, as well as the ratio $α$ between the number of samples and the dimension are fixed, while the dimension diverges. We identify the information-theoretic threshold below which obtaining a positive correlation with the true cluster means is statistically impossible. Additionally, we investigate the performance of the approximate message passing (AMP) algorithm analyzed via its state evolution, which is conjectured to be optimal among polynomial algorithm for this task. We identify in particular the existence of a statistical-to-computational gap between the algorithm that require a signal-to-noise ratio $λ_{\text{alg}} \ge k / \sqrtα $ to perform better than random, and the information theoretic threshold at $λ_{\text{it}} \approx \sqrt{-k ρ\logρ} / \sqrtα$. Finally, we discuss the case of sub-extensive sparsity $ρ$ by comparing the performance of the AMP with other sparsity-enhancing algorithms, such as sparse-PCA and diagonal thresholding. △ Less

Submitted 1 December, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: NeurIPS camera-ready version

Journal ref: Advances in Neural Information Processing Systems (2022), vol 35, pages 27087--27099

arXiv:2205.13503 [pdf, other]

Multi-layer State Evolution Under Random Convolutional Design

Authors: Max Daniels, Cédric Gerbelot, Florent Krzakala, Lenka Zdeborová

Abstract: Signal recovery under generative neural network priors has emerged as a promising direction in statistical inference and computational imaging. Theoretical analysis of reconstruction algorithms under generative priors is, however, challenging. For generative priors with fully connected layers and Gaussian i.i.d. weights, this was achieved by the multi-layer approximate message (ML-AMP) algorithm v… ▽ More Signal recovery under generative neural network priors has emerged as a promising direction in statistical inference and computational imaging. Theoretical analysis of reconstruction algorithms under generative priors is, however, challenging. For generative priors with fully connected layers and Gaussian i.i.d. weights, this was achieved by the multi-layer approximate message (ML-AMP) algorithm via a rigorous state evolution. However, practical generative priors are typically convolutional, allowing for computational benefits and inductive biases, and so the Gaussian i.i.d. weight assumption is very limiting. In this paper, we overcome this limitation and establish the state evolution of ML-AMP for random convolutional layers. We prove in particular that random convolutional layers belong to the same universality class as Gaussian matrices. Our proof technique is of an independent interest as it establishes a map** between convolutional matrices and spatially coupled sensing matrices used in coding theory. △ Less

Submitted 12 October, 2022; v1 submitted 26 May, 2022; originally announced May 2022.

Comments: Accepted to NeurIPS 2022

Journal ref: Advances in Neural Information Processing Systems (2022), vol 52, pages 7089--7102

arXiv:2205.13303 [pdf, other]

Gaussian Universality of Perceptrons with Random Labels

Authors: Federica Gerace, Florent Krzakala, Bruno Loureiro, Ludovic Stephan, Lenka Zdeborová

Abstract: While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that t… ▽ More While classical in many theoretical settings - and in particular in statistical physics-inspired works - the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, a.k.a. the perceptron model, with random labels. We argue that there is a large universality class of high-dimensional input data for which we obtain the same minimum training loss as for Gaussian data with corresponding data covariance. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. On the theoretical side, we prove this universality for an arbitrary mixture of homogeneous Gaussian clouds. Empirically, we show that the universality holds also for a broad range of real datasets. △ Less

Submitted 2 March, 2023; v1 submitted 26 May, 2022; originally announced May 2022.

arXiv:2203.12094 [pdf, other]

doi 10.1088/2632-2153/acb428

Learning curves for the multi-class teacher-student perceptron

Authors: Elisabetta Cornacchia, Francesca Mignacco, Rodrigo Veiga, Cédric Gerbelot, Bruno Loureiro, Lenka Zdeborová

Abstract: One of the most classical results in high-dimensional learning theory provides a closed-form expression for the generalisation error of binary classification with the single-layer teacher-student perceptron on i.i.d. Gaussian inputs. Both Bayes-optimal estimation and empirical risk minimisation (ERM) were extensively analysed for this setting. At the same time, a considerable part of modern machin… ▽ More One of the most classical results in high-dimensional learning theory provides a closed-form expression for the generalisation error of binary classification with the single-layer teacher-student perceptron on i.i.d. Gaussian inputs. Both Bayes-optimal estimation and empirical risk minimisation (ERM) were extensively analysed for this setting. At the same time, a considerable part of modern machine learning practice concerns multi-class classification. Yet, an analogous analysis for the corresponding multi-class teacher-student perceptron was missing. In this manuscript we fill this gap by deriving and evaluating asymptotic expressions for both the Bayes-optimal and ERM generalisation errors in the high-dimensional regime. For Gaussian teacher weights, we investigate the performance of ERM with both cross-entropy and square losses, and explore the role of ridge regularisation in approaching Bayes-optimality. In particular, we observe that regularised cross-entropy minimisation yields close-to-optimal accuracy. Instead, for a binary teacher we show that a first-order phase transition arises in the Bayes-optimal performance. △ Less

Submitted 22 March, 2022; originally announced March 2022.

Comments: 14 pages + appendix

Journal ref: Machine Learning: Science and Technology 4 015019 (2022)

arXiv:2203.07752 [pdf, other]

Optimal denoising of rotationally invariant rectangular matrices

Authors: Emanuele Troiani, Vittorio Erba, Florent Krzakala, Antoine Maillard, Lenka Zdeborová

Abstract: In this manuscript we consider denoising of large rectangular matrices: given a noisy observation of a signal matrix, what is the best way of recovering the signal matrix itself? For Gaussian noise and rotationally-invariant signal priors, we completely characterize the optimal denoiser and its performance in the high-dimensional limit, in which the size of the signal matrix goes to infinity with… ▽ More In this manuscript we consider denoising of large rectangular matrices: given a noisy observation of a signal matrix, what is the best way of recovering the signal matrix itself? For Gaussian noise and rotationally-invariant signal priors, we completely characterize the optimal denoiser and its performance in the high-dimensional limit, in which the size of the signal matrix goes to infinity with fixed aspects ratio, and under the Bayes optimal setting, that is when the statistician knows how the signal and the observations were generated. Our results generalise previous works that considered only symmetric matrices to the more general case of non-symmetric and rectangular ones. We explore analytically and numerically a particular choice of factorized signal prior that models cross-covariance matrices and the matrix factorization problem. As a byproduct of our analysis, we provide an explicit asymptotic evaluation of the rectangular Harish-Chandra-Itzykson-Zuber integral in a special case. △ Less

Submitted 15 March, 2022; originally announced March 2022.

Journal ref: Proceedings of Mathematical and Scientific Machine Learning (MSML), PMLR 190:97-112, 2022

arXiv:2202.10379 [pdf, other]

doi 10.1088/1751-8121/ac8b46

(Dis)assortative Partitions on Random Regular Graphs

Authors: Freya Behrens, Gabriel Arpino, Yaroslav Kivva, Lenka Zdeborová

Abstract: We study the problem of assortative and disassortative partitions on random $d$-regular graphs. Nodes in the graph are partitioned into two non-empty groups. In the assortative partition every node requires at least $H$ of their neighbors to be in their own group. In the disassortative partition they require less than $H$ neighbors to be in their own group. Using the cavity method based on analysi… ▽ More We study the problem of assortative and disassortative partitions on random $d$-regular graphs. Nodes in the graph are partitioned into two non-empty groups. In the assortative partition every node requires at least $H$ of their neighbors to be in their own group. In the disassortative partition they require less than $H$ neighbors to be in their own group. Using the cavity method based on analysis of the Belief Propagation algorithm we establish for which combinations of parameters $(d,H)$ these partitions exist with high probability and for which they do not. For $H>\lceil \frac{d}{2} \rceil $ we establish that the structure of solutions to the assortative partition problems corresponds to the so-called frozen-1RSB. This entails a conjecture of algorithmic hardness of finding these partitions efficiently. For $H \le \lceil \frac{d}{2} \rceil $ we argue that the assortative partition problem is algorithmically easy on average for all $d$. Further we provide arguments about asymptotic equivalence between the assortative partition problem and the disassortative one, going trough a close relation to the problem of single-spin-flip-stable states in spin glasses. In the context of spin glasses, our results on algorithmic hardness imply a conjecture that gapped single spin flip stable states are hard to find which may be a universal reason behind the observation that physical dynamics in glassy systems display convergence to marginal stability. △ Less

Submitted 2 May, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: 21 pages; Corrected usage of the world "planted" in Section 4

Journal ref: J. Phys. A: Math. Theor. 55 395004 (2022)

arXiv:2202.03295 [pdf, other]

doi 10.1088/2632-2153/acd749

Theoretical characterization of uncertainty in high-dimensional linear classification

Authors: Lucas Clarté, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: Being able to reliably assess not only the \emph{accuracy} but also the \emph{uncertainty} of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampl… ▽ More Being able to reliably assess not only the \emph{accuracy} but also the \emph{uncertainty} of models' predictions is an important endeavour in modern machine learning. Even if the model generating the data and labels is known, computing the intrinsic uncertainty after learning the model from a limited number of samples amounts to sampling the corresponding posterior probability measure. Such sampling is computationally challenging in high-dimensional problems and theoretical results on heuristic uncertainty estimators in high-dimensions are thus scarce. In this manuscript, we characterise uncertainty for learning from limited number of samples of high-dimensional Gaussian input data and labels generated by the probit model. In this setting, the Bayesian uncertainty (i.e. the posterior marginals) can be asymptotically obtained by the approximate message passing algorithm, bypassing the canonical but costly Monte Carlo sampling of the posterior. We then provide a closed-form formula for the joint statistics between the logistic classifier, the uncertainty of the statistically optimal Bayesian classifier and the ground-truth probit uncertainty. The formula allows us to investigate calibration of the logistic classifier learning from limited amount of samples. We discuss how over-confidence can be mitigated by appropriately regularising. △ Less

Submitted 14 November, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

Journal ref: Mach. Learn.: Sci. Technol. 4 025029 (2023)

arXiv:2202.00293 [pdf, other]

Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks

Authors: Rodrigo Veiga, Ludovic Stephan, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connect… ▽ More Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates. △ Less

Submitted 14 June, 2023; v1 submitted 1 February, 2022; originally announced February 2022.

Comments: 20 pages

Journal ref: Advances in Neural Information Processing Systems (2022), vol 35, pages {23244--23255)

arXiv:2201.12655 [pdf, other]

doi 10.1088/2632-2153/acf041

Error Scaling Laws for Kernel Classification under Source and Capacity Conditions

Authors: Hugo Cui, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: We consider the problem of kernel classification. While worst-case bounds on the decay rate of the prediction error with the number of samples are known for some classifiers, they often fail to accurately describe the learning curves of real data sets. In this work, we consider the important class of data sets satisfying the standard source and capacity conditions, comprising a number of real data… ▽ More We consider the problem of kernel classification. While worst-case bounds on the decay rate of the prediction error with the number of samples are known for some classifiers, they often fail to accurately describe the learning curves of real data sets. In this work, we consider the important class of data sets satisfying the standard source and capacity conditions, comprising a number of real data sets as we show numerically. Under the Gaussian design, we derive the decay rates for the misclassification (prediction) error as a function of the source and capacity coefficients. We do so for two standard kernel classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two methods. We find that our rates tightly describe the learning curves for this class of data sets, and are also observed on real data. Our results can also be seen as an explicit prediction of the exponents of a scaling law for kernel classification that is accurate on some real datasets. △ Less

Submitted 6 September, 2023; v1 submitted 29 January, 2022; originally announced January 2022.

Journal ref: Mach. Learn.: Sci. Technol. (2023) 4 035033

arXiv:2112.13079 [pdf, other]

doi 10.1088/1742-5468/ac70d2

Aligning random graphs with a sub-tree similarity message-passing algorithm

Authors: Giovanni Piccioli, Guilhem Semerjian, Gabriele Sicuro, Lenka Zdeborová

Abstract: The problem of aligning Erdös-Rényi random graphs is a noisy, average-case version of the graph isomorphism problem, in which a pair of correlated random graphs is observed through a random permutation of their vertices. We study a polynomial time message-passing algorithm devised to solve the inference problem of partially recovering the hidden permutation, in the sparse regime with constant aver… ▽ More The problem of aligning Erdös-Rényi random graphs is a noisy, average-case version of the graph isomorphism problem, in which a pair of correlated random graphs is observed through a random permutation of their vertices. We study a polynomial time message-passing algorithm devised to solve the inference problem of partially recovering the hidden permutation, in the sparse regime with constant average degrees. We perform extensive numerical simulations to determine the range of parameters in which this algorithm achieves partial recovery. We also introduce a generalized ensemble of correlated random graphs with prescribed degree distributions, and extend the algorithm to this case. △ Less

Submitted 4 May, 2022; v1 submitted 24 December, 2021; originally announced December 2021.

Comments: 36 pages, 14 figures, submitted to Journal of Statistical Mechanics: Theory and Experiment. Corrected typos. Modified Figure 1 for clarity. Added references' titles in bibliography. Added definition of "quasi-aligned". Added clarifications about the significance of Nishimori experiments

Journal ref: J. Stat. Mech. (2022) 063401

arXiv:2110.08775 [pdf, other]

doi 10.1088/1742-5468/ac7e4c

Perturbative construction of mean-field equations in extensive-rank matrix factorization and denoising

Authors: Antoine Maillard, Florent Krzakala, Marc Mézard, Lenka Zdeborová

Abstract: Factorization of matrices where the rank of the two factors diverges linearly with their sizes has many applications in diverse areas such as unsupervised representation learning, dictionary learning or sparse coding. We consider a setting where the two factors are generated from known component-wise independent prior distributions, and the statistician observes a (possibly noisy) component-wise f… ▽ More Factorization of matrices where the rank of the two factors diverges linearly with their sizes has many applications in diverse areas such as unsupervised representation learning, dictionary learning or sparse coding. We consider a setting where the two factors are generated from known component-wise independent prior distributions, and the statistician observes a (possibly noisy) component-wise function of their matrix product. In the limit where the dimensions of the matrices tend to infinity, but their ratios remain fixed, we expect to be able to derive closed form expressions for the optimal mean squared error on the estimation of the two factors. However, this remains a very involved mathematical and algorithmic problem. A related, but simpler, problem is extensive-rank matrix denoising, where one aims to reconstruct a matrix with extensive but usually small rank from noisy measurements. In this paper, we approach both these problems using high-temperature expansions at fixed order parameters. This allows to clarify how previous attempts at solving these problems failed at finding an asymptotically exact solution. We provide a systematic way to derive the corrections to these existing approximations, taking into account the structure of correlations particular to the problem. Finally, we illustrate our approach in detail on the case of extensive-rank matrix denoising. We compare our results with known optimal rotationally-invariant estimators, and show how exact asymptotic calculations of the minimal error can be performed using extensive-rank matrix integrals. △ Less

Submitted 8 June, 2022; v1 submitted 17 October, 2021; originally announced October 2021.

Comments: 30 pages (main text), 25 pages of references and appendices. v2: Adding clarifications and a new result to derive the optimal denoising estimator from the asymptotic free energy. v3: corrections to match the published version

Journal ref: J. Stat. Mech. (2022) 083301

arXiv:2106.05418 [pdf, other]

doi 10.1088/2632-2153/ac4f3f

Probing transfer learning with a model of synthetic correlated datasets

Authors: Federica Gerace, Luca Saglietti, Stefano Sarao Mannelli, Andrew Saxe, Lenka Zdeborová

Abstract: Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable… ▽ More Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets. This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task. Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data. Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization. △ Less

Submitted 2 February, 2022; v1 submitted 9 June, 2021; originally announced June 2021.

Journal ref: Machine Learning: Science and Technology 3.1 (2022): 015030

arXiv:2106.03791 [pdf, other]

Learning Gaussian Mixtures with Generalised Linear Models: Precise Asymptotics in High-dimensions

Authors: Bruno Loureiro, Gabriele Sicuro, Cédric Gerbelot, Alessandro Pacco, Florent Krzakala, Lenka Zdeborová

Abstract: Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. In this manuscript, we characterise the learning of a mixture of $K$ Gaussians with generic means and covariances via empirical risk minimisation (ERM) with any convex loss and regularisation. In particular, we prove exact asymptotics characterising the ERM… ▽ More Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks. In this manuscript, we characterise the learning of a mixture of $K$ Gaussians with generic means and covariances via empirical risk minimisation (ERM) with any convex loss and regularisation. In particular, we prove exact asymptotics characterising the ERM estimator in high-dimensions, extending several previous results about Gaussian mixture classification in the literature. We exemplify our result in two tasks of interest in statistical learning: a) classification for a mixture with sparse means, where we study the efficiency of $\ell_1$ penalty with respect to $\ell_2$; b) max-margin multi-class classification, where we characterise the phase transition on the existence of the multi-class logistic maximum likelihood estimator for $K>2$. Finally, we discuss how our theory can be applied beyond the scope of synthetic data, showing that in different cases Gaussian mixtures capture closely the learning curve of classification tasks in real data sets. △ Less

Submitted 14 December, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

Comments: 12 pages + 34 pages of Appendix, 10 figures

Journal ref: Advances in Neural Information Processing Systems 34 (2021): 10144-10157

arXiv:2105.15004 [pdf, other]

doi 10.1088/1742-5468/ac9829

Generalization Error Rates in Kernel Regression: The Crossover from the Noiseless to Noisy Regime

Authors: Hugo Cui, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová

Abstract: In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization… ▽ More In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization and in the noisy optimally regularized case. Intermediary settings have been left substantially uncharted. In this work, we unify and extend this line of work, providing characterization of all regimes and excess error decay rates that can be observed in terms of the interplay of noise and regularization. In particular, we show the existence of a transition in the noisy setting between the noiseless exponents to its noisy values as the sample complexity is increased. Finally, we illustrate how this crossover can also be observed on real data sets. △ Less

Submitted 15 December, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

Comments: 22 pages, 10 figures, 2 tables

Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021) vol 34 p10131--10143. J. Stat. Mech. (2022) 114004

arXiv:2103.04902 [pdf, other]

doi 10.1088/2632-2153/ac0615

Stochasticity helps to navigate rough landscapes: comparing gradient-descent-based algorithms in the phase retrieval problem

Authors: Francesca Mignacco, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: In this paper we investigate how gradient-based algorithms such as gradient descent, (multi-pass) stochastic gradient descent, its persistent variant, and the Langevin algorithm navigate non-convex loss-landscapes and which of them is able to reach the best generalization error at limited sample complexity. We consider the loss landscape of the high-dimensional phase retrieval problem as a prototy… ▽ More In this paper we investigate how gradient-based algorithms such as gradient descent, (multi-pass) stochastic gradient descent, its persistent variant, and the Langevin algorithm navigate non-convex loss-landscapes and which of them is able to reach the best generalization error at limited sample complexity. We consider the loss landscape of the high-dimensional phase retrieval problem as a prototypical highly non-convex example. We observe that for phase retrieval the stochastic variants of gradient descent are able to reach perfect generalization for regions of control parameters where the gradient descent algorithm is not. We apply dynamical mean-field theory from statistical physics to characterize analytically the full trajectories of these algorithms in their continuous-time limit, with a warm start, and for large system sizes. We further unveil several intriguing properties of the landscape and the algorithms such as that the gradient descent can obtain better generalization properties from less informed initializations. △ Less

Submitted 13 April, 2021; v1 submitted 8 March, 2021; originally announced March 2021.

Comments: 28 pages, 11 figures

Journal ref: Mach. Learn.: Sci. Technol. 2 035029 (2021)

arXiv:2102.11742 [pdf, other]

Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed

Authors: Maria Refinetti, Sebastian Goldt, Florent Krzakala, Lenka Zdeborová

Abstract: A recent series of theoretical works showed that the dynamics of neural networks with a certain initialisation are well-captured by kernel methods. Concurrent empirical work demonstrated that kernel methods can come close to the performance of neural networks on some image classification tasks. These results raise the question of whether neural networks only learn successfully if kernels also lear… ▽ More A recent series of theoretical works showed that the dynamics of neural networks with a certain initialisation are well-captured by kernel methods. Concurrent empirical work demonstrated that kernel methods can come close to the performance of neural networks on some image classification tasks. These results raise the question of whether neural networks only learn successfully if kernels also learn successfully, despite neural networks being more expressive. Here, we show theoretically that two-layer neural networks (2LNN) with only a few hidden neurons can beat the performance of kernel learning on a simple Gaussian mixture classification task. We study the high-dimensional limit where the number of samples is linearly proportional to the input dimension, and show that while small 2LNN achieve near-optimal performance on this task, lazy training approaches such as random features and kernel methods do not. Our analysis is based on the derivation of a closed set of equations that track the learning dynamics of the 2LNN and thus allow to extract the asymptotic performance of the network as a function of signal-to-noise ratio and other hyperparameters. We finally illustrate how over-parametrising the neural network leads to faster convergence, but does not improve its final performance. △ Less

Submitted 10 June, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: The accompanying code for this paper is available at https://github.com/mariaref/rfvs2lnn_GMM_online

Journal ref: Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 2021

arXiv:2102.08127 [pdf, other]

doi 10.1088/1742-5468/ac9825

Learning curves of generic features maps for realistic datasets with a teacher-student model

Authors: Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mézard, Lenka Zdeborová

Abstract: Teacher-student models provide a framework in which the typical-case performance of high-dimensional supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets. In this paper, we introduce a Gaussian covariate generalis… ▽ More Teacher-student models provide a framework in which the typical-case performance of high-dimensional supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets. In this paper, we introduce a Gaussian covariate generalisation of the model where the teacher and student can act on different spaces, generated with fixed, but generic feature maps. While still solvable in a closed form, this generalization is able to capture the learning curves for a broad range of realistic data sets, thus redeeming the potential of the teacher-student framework. Our contribution is then two-fold: First, we prove a rigorous formula for the asymptotic training loss and generalisation error. Second, we present a number of situations where the learning curve of the model captures the one of a realistic data set learned with kernel regression and classification, with out-of-the-box feature maps such as random projections or scattering transforms, or with pre-learned ones - such as the features learned by training multi-layer neural networks. We discuss both the power and the limitations of the framework. △ Less

Submitted 14 December, 2021; v1 submitted 16 February, 2021; originally announced February 2021.

Comments: v3: NeurIPS camera-ready

Journal ref: 35th Conference on Neural Information Processing Systems (NeurIPS 2021), vol 34 p10137--18151. J. Stat. Mech. (2022) 114001

arXiv:2012.04524 [pdf, other]

Construction of optimal spectral methods in phase retrieval

Authors: Antoine Maillard, Florent Krzakala, Yue M. Lu, Lenka Zdeborová

Abstract: We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbfΦ \mathbf{X}^\star|$, in which $\mathbfΦ$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly corr… ▽ More We consider the phase retrieval problem, in which the observer wishes to recover a $n$-dimensional real or complex signal $\mathbf{X}^\star$ from the (possibly noisy) observation of $|\mathbfΦ \mathbf{X}^\star|$, in which $\mathbfΦ$ is a matrix of size $m \times n$. We consider a \emph{high-dimensional} setting where $n,m \to \infty$ with $m/n = \mathcal{O}(1)$, and a large class of (possibly correlated) random matrices $\mathbfΦ$ and observation channels. Spectral methods are a powerful tool to obtain approximate observations of the signal $\mathbf{X}^\star$ which can be then used as initialization for a subsequent algorithm, at a low computational cost. In this paper, we extend and unify previous results and approaches on spectral methods for the phase retrieval problem. More precisely, we combine the linearization of message-passing algorithms and the analysis of the \emph{Bethe Hessian}, a classical tool of statistical physics. Using this toolbox, we show how to derive optimal spectral methods for arbitrary channel noise and right-unitarily invariant matrix $\mathbfΦ$, in an automated manner (i.e. with no optimization over any hyperparameter or preprocessing function). △ Less

Submitted 14 October, 2021; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: 14 pages + references and appendix. v2: Version updated to match the one accepted at MSML 2021. v3: Adding a reference to a previous work mentioning marginal stability and its connection to Bayes-optimality

Journal ref: Proceedings of Machine Learning Research vol 145:1-28, 2021 2nd Annual Conference on Mathematical and Scientific Machine Learning (MSML 21)

arXiv:2012.00194 [pdf, ps, other]

Solvable Model for Inheriting the Regularization through Knowledge Distillation

Authors: Luca Saglietti, Lenka Zdeborová

Abstract: In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge distillation where a smaller neural network is trained using the outputs of a larger neural network is a particularly interesting case of transfer learning. In the present work, we introduce a statistical ph… ▽ More In recent years the empirical success of transfer learning with neural networks has stimulated an increasing interest in obtaining a theoretical understanding of its core properties. Knowledge distillation where a smaller neural network is trained using the outputs of a larger neural network is a particularly interesting case of transfer learning. In the present work, we introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distillation (KD) in shallow neural networks. Focusing the analysis on a solvable model that exhibits a non-trivial generalization gap, we investigate the effectiveness of KD. We are able to show that, through KD, the regularization properties of the larger teacher model can be inherited by the smaller student and that the yielded generalization performance is closely linked to and limited by the optimality of the teacher. Finally, we analyze the double descent phenomenology that can arise in the considered KD setting. △ Less

Submitted 2 December, 2020; v1 submitted 30 November, 2020; originally announced December 2020.

Journal ref: Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:809-846, 2022

arXiv:2010.13700 [pdf, other]

doi 10.1088/1751-8121/abee9d

The planted $k$-factor problem

Authors: Gabriele Sicuro, Lenka Zdeborová

Abstract: We consider the problem of recovering an unknown $k$-factor, hidden in a weighted random graph. For $k=1$ this is the planted matching problem, while the $k=2$ case is closely related to the planted travelling salesman problem. The inference problem is solved by exploiting the information arising from the use of two different distributions for the weights on the edges inside and outside the plante… ▽ More We consider the problem of recovering an unknown $k$-factor, hidden in a weighted random graph. For $k=1$ this is the planted matching problem, while the $k=2$ case is closely related to the planted travelling salesman problem. The inference problem is solved by exploiting the information arising from the use of two different distributions for the weights on the edges inside and outside the planted sub-graph. We argue that, in the large size limit, a phase transition can appear between a full and a partial recovery phase as function of the signal-to-noise ratio. We give a criterion for the location of the transition. △ Less

Submitted 8 April, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

Comments: 21 pages, 4 figures

Journal ref: J. Phys. A: Math. Theor. 54 175002 (2021)

arXiv:2009.09422 [pdf, other]

doi 10.1073/pnas.2106548118

Epidemic mitigation by statistical inference from contact tracing data

Authors: Antoine Baker, Indaco Biazzo, Alfredo Braunstein, Giovanni Catania, Luca Dall'Asta, Alessandro Ingrosso, Florent Krzakala, Fabio Mazza, Marc Mézard, Anna Paola Muntoni, Maria Refinetti, Stefano Sarao Mannelli, Lenka Zdeborová

Abstract: Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing th… ▽ More Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing their performance and assessing their impact on the mitigation of the epidemic. We develop Bayesian inference methods to estimate the risk that an individual is infected. This inference is based on the list of his recent contacts and their own risk levels, as well as personal information such as results of tests or presence of syndromes. We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic. Our results show that in some range of epidemic spreading (typically when the manual tracing of all contacts of infected people becomes practically impossible, but before the fraction of infected people reaches the scale where a lock-down becomes unavoidable), this inference of individuals at risk could be an efficient way to mitigate the epidemic. Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact. Such communication may be encrypted and anonymized and thus compatible with privacy preserving standards. We conclude that probabilistic risk estimation is capable to enhance performance of digital contact tracing and should be considered in the currently developed mobile applications. △ Less

Submitted 20 September, 2020; originally announced September 2020.

Comments: 21 pages, 7 figures

ACM Class: G.3; G.4; I.2.11; J.3

Journal ref: PNAS 2021 Vol. 118 No. 32 e2106548118

arXiv:2006.15459 [pdf, other]

Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

Authors: Stefano Sarao Mannelli, Eric Vanden-Eijnden, Lenka Zdeborová

Abstract: We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$. We describe… ▽ More We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width $m$ is larger than the input dimension $d$. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width $m^*\le m$. We describe how the empirical loss landscape is affected by the number $n$ of data samples and the width $m^*$ of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on $n$, $d$, and $m^*$, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments. △ Less

Submitted 18 August, 2020; v1 submitted 27 June, 2020; originally announced June 2020.

Comments: 10 pages, 4 figures + appendix

Journal ref: Advances in Neural Information Processing Systems, v33, page 13445--13455, 2020

arXiv:2006.14709 [pdf, other]

The Gaussian equivalence of generative models for learning with shallow neural networks

Authors: Sebastian Goldt, Bruno Loureiro, Galen Reeves, Florent Krzakala, Marc Mézard, Lenka Zdeborová

Abstract: Understanding the impact of data structure on the computational tractability of learning is a key challenge for the theory of neural networks. Many theoretical works do not explicitly model training data, or assume that inputs are drawn component-wise independently from some simple probability distribution. Here, we go beyond this simple paradigm by studying the performance of neural networks trai… ▽ More Understanding the impact of data structure on the computational tractability of learning is a key challenge for the theory of neural networks. Many theoretical works do not explicitly model training data, or assume that inputs are drawn component-wise independently from some simple probability distribution. Here, we go beyond this simple paradigm by studying the performance of neural networks trained on data drawn from pre-trained generative models. This is possible due to a Gaussian equivalence stating that the key metrics of interest, such as the training and test errors, can be fully captured by an appropriately chosen Gaussian model. We provide three strands of rigorous, analytical and numerical evidence corroborating this equivalence. First, we establish rigorous conditions for the Gaussian equivalence to hold in the case of single-layer generative models, as well as deterministic rates for convergence in distribution. Second, we leverage this equivalence to derive a closed set of equations describing the generalisation performance of two widely studied machine learning problems: two-layer neural networks trained using one-pass stochastic gradient descent, and full-batch pre-learned features or kernel methods. Finally, we perform experiments demonstrating how our theory applies to deep, pre-trained generative models. These results open a viable path to the theoretical study of machine learning models with realistic data. △ Less

Submitted 21 May, 2021; v1 submitted 25 June, 2020; originally announced June 2020.

Comments: The accompanying code for this paper is available at https://github.com/sgoldt/gaussian-equiv-2layer

Journal ref: Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference, PMLR 145:426-471 (2021)

arXiv:2006.06997 [pdf, other]

Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

Authors: Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimensio… ▽ More Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable develo** a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima. △ Less

Submitted 12 June, 2020; originally announced June 2020.

Comments: 9 pages, 5 figures + appendix

Journal ref: Advances in Neural Information Processing Systems, v22, page 3265--327, 2020

arXiv:2006.06560 [pdf, other]

Generalization error in high-dimensional perceptrons: Approaching Bayes error with convex optimization

Authors: Benjamin Aubin, Florent Krzakala, Yue M. Lu, Lenka Zdeborová

Abstract: We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we… ▽ More We consider a commonly studied supervised classification of a synthetic dataset whose labels are generated by feeding a one-layer neural network with random iid inputs. We study the generalization performances of standard classifiers in the high-dimensional regime where $α=n/d$ is kept finite in the limit of a high dimension $d$ and number of samples $n$. Our contribution is three-fold: First, we prove a formula for the generalization error achieved by $\ell_2$ regularized classifiers that minimize a convex loss. This formula was first obtained by the heuristic replica method of statistical physics. Secondly, focussing on commonly used loss functions and optimizing the $\ell_2$ regularization strength, we observe that while ridge regression performance is poor, logistic and hinge regression are surprisingly able to approach the Bayes-optimal generalization error extremely closely. As $α\to \infty$ they lead to Bayes-optimal rates, a fact that does not follow from predictions of margin-based generalization error bounds. Third, we design an optimal loss and regularizer that provably leads to Bayes-optimal generalization error. △ Less

Submitted 7 November, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 11 pages + 45 pages Supplementary Material / 5 figures, v2 revised and accepted at NeurIPS

Journal ref: Advances in Neural Information Processing Systems, v33, pages 12199--12210, 2020

arXiv:2006.06098 [pdf, other]

doi 10.1088/1742-5468/ac3a80

Dynamical mean-field theory for stochastic gradient descent in Gaussian mixture classification

Authors: Francesca Mignacco, Florent Krzakala, Pierfrancesco Urbani, Lenka Zdeborová

Abstract: We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD c… ▽ More We analyze in a closed form the learning dynamics of stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD can be extended to a continuous-time limit that we call stochastic gradient flow. In the full-batch limit, we recover the standard gradient flow. We apply dynamical mean-field theory from statistical physics to track the dynamics of the algorithm in the high-dimensional limit via a self-consistent stochastic process. We explore the performance of the algorithm as a function of the control parameters shedding light on how it navigates the loss landscape. △ Less

Submitted 9 November, 2021; v1 submitted 10 June, 2020; originally announced June 2020.

Comments: 8 pages + appendix, 4 figures

Journal ref: J. Stat. Mech. 2021 124008 & NeurIPS 2020

Showing 1–50 of 150 results for author: Zdeborova, L