Search | arXiv e-print repository

Implicit Bias of Mirror Flow on Separable Data

Authors: Scott Pesme, Radu-Alexandru Dragomir, Nicolas Flammarion

Abstract: We examine the continuous-time counterpart of mirror descent, namely mirror flow, on classification problems which are linearly separable. Such problems are minimised `at infinity' and have many possible solutions; we study which solution is preferred by the algorithm depending on the mirror potential. For exponential tailed losses and under mild assumptions on the potential, we show that the iter… ▽ More We examine the continuous-time counterpart of mirror descent, namely mirror flow, on classification problems which are linearly separable. Such problems are minimised `at infinity' and have many possible solutions; we study which solution is preferred by the algorithm depending on the mirror potential. For exponential tailed losses and under mild assumptions on the potential, we show that the iterates converge in direction towards a $φ_\infty$-maximum margin classifier. The function $φ_\infty$ is the $\textit{horizon function}$ of the mirror potential and characterises its shape `at infinity'. When the potential is separable, a simple formula allows to compute this function. We analyse several examples of potentials and provide numerical experiments highlighting our results. △ Less

Submitted 19 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

Comments: Exact same text as first version but the acknowledgments section is updated

arXiv:2403.05293 [pdf, other]

Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks

Authors: Hristo Papazov, Scott Pesme, Nicolas Flammarion

Abstract: In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $γ$ and momentum parameter $β$ that allows us to identify an intrinsic quantity $λ= \frac{ γ}{ (1 - β)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. Whe… ▽ More In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $γ$ and momentum parameter $β$ that allows us to identify an intrinsic quantity $λ= \frac{ γ}{ (1 - β)^2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $λ$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims. △ Less

Submitted 8 March, 2024; originally announced March 2024.

arXiv:2304.00488 [pdf, other]

Saddle-to-Saddle Dynamics in Diagonal Linear Networks

Authors: Scott Pesme, Nicolas Flammarion

Abstract: In this paper we fully describe the trajectory of gradient flow over diagonal linear networks in the limit of vanishing initialisation. We show that the limiting flow successively jumps from a saddle of the training loss to another until reaching the minimum $\ell_1$-norm solution. This saddle-to-saddle dynamics translates to an incremental learning process as each saddle corresponds to the minimi… ▽ More In this paper we fully describe the trajectory of gradient flow over diagonal linear networks in the limit of vanishing initialisation. We show that the limiting flow successively jumps from a saddle of the training loss to another until reaching the minimum $\ell_1$-norm solution. This saddle-to-saddle dynamics translates to an incremental learning process as each saddle corresponds to the minimiser of the loss constrained to an active set outside of which the coordinates must be zero. We explicitly characterise the visited saddles as well as the jum** times through a recursive algorithm reminiscent of the LARS algorithm used for computing the Lasso path. Our proof leverages a convenient arc-length time-reparametrisation which enables to keep track of the heteroclinic transitions between the jumps. Our analysis requires negligible assumptions on the data, applies to both under and overparametrised settings and covers complex cases where there is no monotonicity of the number of active coordinates. We provide numerical experiments to support our findings. △ Less

Submitted 25 October, 2023; v1 submitted 2 April, 2023; originally announced April 2023.

arXiv:2302.08982 [pdf, other]

(S)GD over Diagonal Linear Networks: Implicit Regularisation, Large Stepsizes and Edge of Stability

Authors: Mathieu Even, Scott Pesme, Suriya Gunasekar, Nicolas Flammarion

Abstract: In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp ch… ▽ More In this paper, we investigate the impact of stochasticity and large stepsizes on the implicit regularisation of gradient descent (GD) and stochastic gradient descent (SGD) over diagonal linear networks. We prove the convergence of GD and SGD with macroscopic stepsizes in an overparametrised regression setting and characterise their solutions through an implicit regularisation problem. Our crisp characterisation leads to qualitative insights about the impact of stochasticity and stepsizes on the recovered solution. Specifically, we show that large stepsizes consistently benefit SGD for sparse regression problems, while they can hinder the recovery of sparse solutions for GD. These effects are magnified for stepsizes in a tight window just below the divergence threshold, in the "edge of stability" regime. Our findings are supported by experimental results. △ Less

Submitted 25 October, 2023; v1 submitted 17 February, 2023; originally announced February 2023.

arXiv:2106.09524 [pdf, other]

Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity

Authors: Scott Pesme, Loucas Pillaud-Vivien, Nicolas Flammarion

Abstract: Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove tha… ▽ More Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent. △ Less

Submitted 7 December, 2021; v1 submitted 17 June, 2021; originally announced June 2021.

arXiv:2007.00534 [pdf, ps, other]

On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Authors: Scott Pesme, Aymeric Dieuleveut, Nicolas Flammarion

Abstract: Constant step-size Stochastic Gradient Descent exhibits two phases: a transient phase during which iterates make fast progress towards the optimum, followed by a stationary phase during which iterates oscillate around the optimal point. In this paper, we show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates. We analyse the cla… ▽ More Constant step-size Stochastic Gradient Descent exhibits two phases: a transient phase during which iterates make fast progress towards the optimum, followed by a stationary phase during which iterates oscillate around the optimal point. In this paper, we show that efficiently detecting this transition and appropriately decreasing the step size can lead to fast convergence rates. We analyse the classical statistical test proposed by Pflug (1983), based on the inner product between consecutive stochastic gradients. Even in the simple case where the objective function is quadratic we show that this test cannot lead to an adequate convergence diagnostic. We then propose a novel and simple statistical procedure that accurately detects stationarity and we provide experimental results showing state-of-the-art performance on synthetic and real-world datasets. △ Less

Submitted 1 July, 2020; originally announced July 2020.

arXiv:2007.00399 [pdf, ps, other]

Online Robust Regression via SGD on the l1 loss

Authors: Scott Pesme, Nicolas Flammarion

Abstract: We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner, one data point after the other. More specifically, for a true parameter $θ^*$, we consider the corrupted Gaussian linear model $y = \langle x , \ θ^* \rangle + \varepsilon + b$ where the adversarial noise $b$ can take any value with probability $η$ and equals zero otherwis… ▽ More We consider the robust linear regression problem in the online setting where we have access to the data in a streaming manner, one data point after the other. More specifically, for a true parameter $θ^*$, we consider the corrupted Gaussian linear model $y = \langle x , \ θ^* \rangle + \varepsilon + b$ where the adversarial noise $b$ can take any value with probability $η$ and equals zero otherwise. We consider this adversary to be oblivious (i.e., $b$ independent of the data) since this is the only contamination model under which consistency is possible. Current algorithms rely on having the whole data at hand in order to identify and remove the outliers. In contrast, we show in this work that stochastic gradient descent on the $\ell_1$ loss converges to the true parameter vector at a $\tilde{O}( 1 / (1 - η)^2 n )$ rate which is independent of the values of the contaminated measurements. Our proof relies on the elegant smoothing of the non-smooth $\ell_1$ loss by the Gaussian data and a classical non-asymptotic analysis of Polyak-Ruppert averaged SGD. In addition, we provide experimental evidence of the efficiency of this simple and highly scalable algorithm. △ Less

Submitted 1 July, 2020; originally announced July 2020.

Showing 1–7 of 7 results for author: Pesme, S