Search | arXiv e-print repository

A Phase Transition in Diffusion Models Reveals the Hierarchical Nature of Data

Authors: Antonio Sclocchi, Alessandro Favero, Matthieu Wyart

Abstract: Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organised in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underl… ▽ More Understanding the structure of real data is paramount in advancing modern deep-learning methodologies. Natural data such as images are believed to be composed of features organised in a hierarchical and combinatorial manner, which neural networks capture during learning. Recent advancements show that diffusion models can generate high-quality images, hinting at their ability to capture this underlying structure. We study this phenomenon in a hierarchical generative model of data. We find that the backward diffusion process acting after a time $t$ is governed by a phase transition at some threshold time, where the probability of reconstructing high-level features, like the class of an image, suddenly drops. Instead, the reconstruction of low-level features, such as specific details of an image, evolves smoothly across the whole diffusion process. This result implies that at times beyond the transition, the class has changed but the generated sample may still be composed of low-level elements of the initial image. We validate these theoretical insights through numerical experiments on class-unconditional ImageNet diffusion models. Our analysis characterises the relationship between time and scale in diffusion models and puts forward generative models as powerful tools to model combinatorial data properties. △ Less

Submitted 4 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: 21 pages, 16 figures

arXiv:2309.10688 [pdf, other]

doi 10.1073/pnas.2316301121

On the different regimes of Stochastic Gradient Descent

Authors: Antonio Sclocchi, Matthieu Wyart

Abstract: Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $η$. For small $B$ and large $η$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' $T\equiv η/B$. Yet this description is observed t… ▽ More Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $η$. For small $B$ and large $η$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' $T\equiv η/B$. Yet this description is observed to break down for sufficiently large batches $B\geq B^*$, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the $B$-$η$ plane that separates three dynamical phases: (i) a noise-dominated SGD governed by temperature, (ii) a large-first-step-dominated SGD and (iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size $B^*$ separating regimes (i) and (ii) scale with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem. △ Less

Submitted 27 February, 2024; v1 submitted 19 September, 2023; originally announced September 2023.

Comments: Main: 8 pages, 4 figures; Appendix: 15 pages, 11 figures

Journal ref: Proceedings of the National Academy of Sciences 121.9 (2024): e2316301121

arXiv:2301.13703 [pdf, other]

Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Authors: Antonio Sclocchi, Mario Geiger, Matthieu Wyart

Abstract: Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $α$ are varied. For gradient descent, $α$ is a k… ▽ More Understanding when the noise in stochastic gradient descent (SGD) affects generalization of deep neural networks remains a challenge, complicated by the fact that networks can operate in distinct training regimes. Here we study how the magnitude of this noise $T$ affects performance as the size of the training set $P$ and the scale of initialization $α$ are varied. For gradient descent, $α$ is a key parameter that controls if the network is `lazy'($α\gg1$) or instead learns features ($α\ll1$). For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the $(α,T)$ plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. Moreover, although increasing $T$ or decreasing $α$ both allow the net to escape the lazy regime, these changes can have opposite effects on performance. (ii) Most importantly, we find that the characteristic temperature $T_c$ where the noise of SGD starts affecting the trained model (and eventually performance) is a power law of $P$. We relate this finding with the observation that key dynamical quantities, such as the total variation of weights during training, depend on both $T$ and $P$ as power laws. These results indicate that a key effect of SGD noise occurs late in training by affecting the stop** process whereby all data are fitted. Indeed, we argue that due to SGD noise, nets must develop a stronger `signal', i.e. larger informative weights, to fit the data, leading to a longer training time. A stronger signal and a longer training time are also required when the size of the training set $P$ increases. We confirm these views in the perceptron model, where signal and noise can be precisely measured. Interestingly, exponents characterizing the effect of SGD depend on the density of data near the decision boundary, as we explain. △ Less

Submitted 30 May, 2023; v1 submitted 31 January, 2023; originally announced January 2023.

Comments: 25 pages, 21 figures, added analysis in feature-learning

arXiv:2202.03348 [pdf, other]

Failure and success of the spectral bias prediction for Kernel Ridge Regression: the case of low-dimensional data

Authors: Umberto M. Tomasini, Antonio Sclocchi, Matthieu Wyart

Abstract: Recently, several theories including the replica method made predictions for the generalization error of Kernel Ridge Regression. In some regimes, they predict that the method has a `spectral bias': decomposing the true function $f^*$ on the eigenbasis of the kernel, it fits well the coefficients associated with the O(P) largest eigenvalues, where $P$ is the size of the training set. This predicti… ▽ More Recently, several theories including the replica method made predictions for the generalization error of Kernel Ridge Regression. In some regimes, they predict that the method has a `spectral bias': decomposing the true function $f^*$ on the eigenbasis of the kernel, it fits well the coefficients associated with the O(P) largest eigenvalues, where $P$ is the size of the training set. This prediction works very well on benchmark data sets such as images, yet the assumptions these approaches make on the data are never satisfied in practice. To clarify when the spectral bias prediction holds, we first focus on a one-dimensional model where rigorous results are obtained and then use scaling arguments to generalize and test our findings in higher dimensions. Our predictions include the classification case $f(x)=$sign$(x_1)$ with a data distribution that vanishes at the decision boundary $p(x)\sim x_1^χ$. For $χ>0$ and a Laplace kernel, we find that (i) there exists a cross-over ridge $λ^*_{d,χ}(P)\sim P^{-\frac{1}{d+χ}}$ such that for $λ\gg λ^*_{d,χ}(P)$, the replica method applies, but not for $λ\llλ^*_{d,χ}(P)$, (ii) in the ridge-less case, spectral bias predicts the correct training curve exponent only in the limit $d\rightarrow\infty$. △ Less

Submitted 16 February, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: 34 pages, 11 figures

arXiv:2106.08581 [pdf, other]

doi 10.1103/PhysRevE.105.024134

High dimensional optimization under non-convex excluded volume constraints

Authors: Antonio Sclocchi, Pierfrancesco Urbani

Abstract: We consider high dimensional random optimization problems where the dynamical variables are subjected to non-convex excluded volume constraints. We focus on the case in which the cost function is a simple quadratic cost and the excluded volume constraints are modeled by a perceptron constraint satisfaction problem. We show that depending on the density of constraints, one can have different situat… ▽ More We consider high dimensional random optimization problems where the dynamical variables are subjected to non-convex excluded volume constraints. We focus on the case in which the cost function is a simple quadratic cost and the excluded volume constraints are modeled by a perceptron constraint satisfaction problem. We show that depending on the density of constraints, one can have different situations. If the number of constraints is small, one typically has a phase where the ground state of the cost function is unique and sits on the boundary of the island of configurations allowed by the constraints. In this case, there is a hypostatic number of marginally satisfied constraints. If the number of constraints is increased one enters a glassy phase where the cost function has many local minima sitting again on the boundary of the regions of allowed configurations. At the phase transition point, the total number of marginally satisfied constraints becomes equal to the number of degrees of freedom in the problem and therefore we say that these minima are isostatic. We conjecture that by increasing further the constraints the system stays isostatic up to the point where the volume of available phase space shrinks to zero. We derive our results using the replica method and we also analyze a dynamical algorithm, the Karush-Kuhn-Tucker algorithm, through dynamical mean-field theory and we show how to recover the results of the replica approach in the replica symmetric phase. △ Less

Submitted 22 December, 2021; v1 submitted 16 June, 2021; originally announced June 2021.

Comments: 7 pages, 3 figures

arXiv:2010.10253 [pdf, other]

doi 10.21468/SciPostPhys.10.1.013

Proliferation of non-linear excitations in the piecewise-linear perceptron

Authors: Antonio Sclocchi, Pierfrancesco Urbani

Abstract: We investigate the properties of local minima of the energy landscape of a continuous non-convex optimization problem, the spherical perceptron with piecewise linear cost function and show that they are critical, marginally stable and displaying a set of pseudogaps, singularities and non-linear excitations whose properties appear to be in the same universality class of jammed packings of hard sphe… ▽ More We investigate the properties of local minima of the energy landscape of a continuous non-convex optimization problem, the spherical perceptron with piecewise linear cost function and show that they are critical, marginally stable and displaying a set of pseudogaps, singularities and non-linear excitations whose properties appear to be in the same universality class of jammed packings of hard spheres. The piecewise linear perceptron problem appears as an evolution of the purely linear perceptron optimization problem that has been recently investigated in [1]. Its cost function contains two non-analytic points where the derivative has a jump. Correspondingly, in the non-convex/glassy phase, these two points give rise to four pseudogaps in the force distribution and this induces four power laws in the gap distribution as well. In addition one can define an extended notion of isostaticity and show that local minima appear again to be isostatic in this phase. We believe that our results generalize naturally to more complex cases with a proliferation of non-linear excitations as the number of non-analytic points in the cost function is increased. △ Less

Submitted 14 December, 2020; v1 submitted 20 October, 2020; originally announced October 2020.

Comments: 14 pages, 7 figures

Journal ref: SciPost Phys. 10, 013 (2021)

arXiv:2010.02158 [pdf, other]

doi 10.1088/1742-5468/abdc16

Surfing on minima of isostatic landscapes: avalanches and unjamming transition

Authors: Silvio Franz, Antonio Sclocchi, Pierfrancesco Urbani

Abstract: Recently, we showed that optimization problems, both in infinite as well as in finite dimensions, for continuous variables and soft excluded volume constraints, can display entire isostatic phases where local minima of the cost function are marginally stable configurations endowed with non-linear excitations [1,2]. In this work we describe an athermal adiabatic algorithm to explore with continuity… ▽ More Recently, we showed that optimization problems, both in infinite as well as in finite dimensions, for continuous variables and soft excluded volume constraints, can display entire isostatic phases where local minima of the cost function are marginally stable configurations endowed with non-linear excitations [1,2]. In this work we describe an athermal adiabatic algorithm to explore with continuity the corresponding rough high-dimensional landscape. We concentrate on a prototype problem of this kind, the spherical perceptron optimization problem with linear cost function (hinge loss). This algorithm allows to "surf" between isostatic marginally stable configurations and to investigate some properties of such landscape. In particular we focus on the statistics of avalanches occurring when local minima are destabilized. We show that when perturbing such minima, the system undergoes plastic rearrangements whose size is power law distributed and we characterize the corresponding critical exponent. Finally we investigate the critical properties of the unjamming transition, showing that the linear interaction potential gives rise to logarithmic behavior in the scaling of energy and pressure as a function of the distance from the unjamming point. For some quantities, the logarithmic corrections can be gauged out. This is the case of the number of soft constraints that are violated as a function of the distance from jamming which follows a non-trivial power law behavior. △ Less

Submitted 10 December, 2020; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: 22 pages, 13 figures

Journal ref: J. Stat. Mech. (2021) 023208

arXiv:2002.04987 [pdf, other]

doi 10.21468/SciPostPhys.9.1.012

Critical energy landscape of linear soft spheres

Authors: Silvio Franz, Antonio Sclocchi, Pierfrancesco Urbani

Abstract: We show that soft spheres interacting with a linear ramp potential when overcompressed beyond the jamming point fall in an amorphous solid phase which is critical, mechanically marginally stable and share many features with the jamming point itself. In the whole phase, the relevant local minima of the potential energy landscape display an isostatic contact network of perfectly touching spheres who… ▽ More We show that soft spheres interacting with a linear ramp potential when overcompressed beyond the jamming point fall in an amorphous solid phase which is critical, mechanically marginally stable and share many features with the jamming point itself. In the whole phase, the relevant local minima of the potential energy landscape display an isostatic contact network of perfectly touching spheres whose statistics is controlled by an infinite lengthscale. Excitations around such energy minima are non-linear, system spanning, and characterized by a set of non-trivial critical exponents. We perform numerical simulations to measure their values and show that, while they coincide, within numerical precision, with the critical exponents appearing at jamming, the nature of the corresponding excitations is richer. Therefore, linear soft spheres appear as a novel class of finite dimensional systems that self-organize into new, critical, marginally stable, states. △ Less

Submitted 12 June, 2020; v1 submitted 12 February, 2020; originally announced February 2020.

Comments: 12 pages, 10 figures, Submitted to SciPost Physics

Journal ref: SciPost Phys. 9, 012 (2020)

arXiv:1902.08243 [pdf, other]

doi 10.1103/PhysRevLett.123.115702

Critical jammed phase of the linear perceptron

Authors: Silvio Franz, Antonio Sclocchi, Pierfrancesco Urbani

Abstract: Criticality in statistical physics naturally emerges at isolated points in the phase diagram. Jamming of spheres is not an exception: varying density, it is the critical point that separates the unjammed phase where spheres do not overlap and the jammed phase where they cannot be arranged without overlaps. The same remains true in more general constraint satisfaction problems with continuous varia… ▽ More Criticality in statistical physics naturally emerges at isolated points in the phase diagram. Jamming of spheres is not an exception: varying density, it is the critical point that separates the unjammed phase where spheres do not overlap and the jammed phase where they cannot be arranged without overlaps. The same remains true in more general constraint satisfaction problems with continuous variables (CCSP) where jamming coincides with the (protocol dependent) satisfiability transition point. In this work we show that by carefully choosing the cost function to be minimized, the region of criticality extends to occupy a whole region of the jammed phase. As a working example, we consider the spherical perceptron with a linear cost function in the unsatisfiable (UNSAT) jammed phase and we perform numerical simulations which show critical power laws emerging in the configurations obtained minimizing the linear cost function. We develop a scaling theory to compute the emerging critical exponents. △ Less

Submitted 5 August, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

Comments: 10 pages, 4 figures

Journal ref: Phys. Rev. Lett. 123, 115702 (2019)

arXiv:1611.05085 [pdf, other]

doi 10.1103/PhysRevB.95.054307

Topology of a dissipative spin: dynamical Chern number, bath induced non-adiabaticity and a quantum dynamo effect

Authors: Loic Henriet, Antonio Sclocchi, Peter P. Orth, Karyn Le Hur

Abstract: We analyze the topological deformations of a spin-1/2 in an effective magnetic field induced by an ohmic quantum dissipative environment at zero temperature. From Bethe Ansatz results and a variational approach, we confirm that the Chern number is preserved in the delocalized phase for $α<1$. We report a divergence of the Berry curvature at the equator when $α_c=1$ that appears at the localization… ▽ More We analyze the topological deformations of a spin-1/2 in an effective magnetic field induced by an ohmic quantum dissipative environment at zero temperature. From Bethe Ansatz results and a variational approach, we confirm that the Chern number is preserved in the delocalized phase for $α<1$. We report a divergence of the Berry curvature at the equator when $α_c=1$ that appears at the localization Kosterlitz-Thouless quantum phase transition in this model. Recent experiments in quantum circuits have engineered non-equilibrium protocols in time to access topological properties at equilibrium from the measure of the (quasi-)adiabatic out-of-equilibrium spin expectation values. Applying a numerically exact stochastic Schrödinger equation we find that, for a fixed sweep velocity, the bath induces a crossover from (quasi-)adiabatic to non-adiabatic dynamical behavior when the spin bath coupling increases. We also investigate the particular regime $H/ω_c \ll v/H \ll 1$, where the dynamical Chern number observable built from out-of-equilibrium spin expectation values vanishes at $α=1/2$. In this regime, the map** to an interacting resonance level model enables us to characterize the evolution of the dynamical Chern number in the vicinity of $α=1/2$. Then, we provide an intuitive physical explanation of the breakdown of adiabaticity in analogy to the Faraday effect in electromagnetism. We demonstrate that the driving of the spin leads to the production of a large number of bosonic excitations in the bath, which in return strongly affect the spin dynamics. Finally, we quantify the spin-bath entanglement and build an analogy with an effective model at thermal equilibrium. △ Less

Submitted 15 November, 2016; originally announced November 2016.

Journal ref: Phys. Rev. B 95, 054307 (2017)

Showing 1–10 of 10 results for author: Sclocchi, A