Skip to main content

Showing 1–9 of 9 results for author: Nacson, M S

.
  1. arXiv:2402.06323  [pdf, other

    cs.LG stat.ML

    How Uniform Random Weights Induce Non-uniform Bias: Typical Interpolating Neural Networks Generalize with Narrow Teachers

    Authors: Gon Buzaglo, Itamar Harel, Mor Shpigel Nacson, Alon Brutzkus, Nathan Srebro, Daniel Soudry

    Abstract: Background. A main theoretical puzzle is why over-parameterized Neural Networks (NNs) generalize well when trained to zero loss (i.e., so they interpolate the data). Usually, the NN is trained with Stochastic Gradient Descent (SGD) or one of its variants. However, recent empirical work examined the generalization of a random NN that interpolates the data: the NN was sampled from a seemingly unifor… ▽ More

    Submitted 9 June, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

  2. arXiv:2306.17499  [pdf, other

    cs.LG

    The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks

    Authors: Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie, Tomer Michaeli, Daniel Soudry

    Abstract: We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors), whose second derivative has a bounded weighted $L^1$ norm. Not… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

    Comments: Published at ICLR 2023. Fixed statements and proofs of Proposition 3 and Theorem 2

  3. arXiv:2305.13064  [pdf, other

    cs.LG math.OC stat.ML

    Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

    Authors: Itai Kreisler, Mor Shpigel Nacson, Daniel Soudry, Yair Carmon

    Abstract: Recent research shows that when Gradient Descent (GD) is applied to neural networks, the loss almost never decreases monotonically. Instead, the loss oscillates as gradient descent converges to its ''Edge of Stability'' (EoS). Here, we find a quantity that does decrease monotonically throughout GD training: the sharpness attained by the gradient flow solution (GFS)-the solution that would be obtai… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  4. arXiv:2102.09769  [pdf, other

    cs.LG

    On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

    Authors: Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson, Daniel Soudry

    Abstract: Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called "rich regimes". However, the initialization structure is richer than the overall scale alone and involve… ▽ More

    Submitted 19 February, 2021; originally announced February 2021.

    Comments: 33 pages, 2 figures

    MSC Class: 68T07 (Primary) ACM Class: I.2.6; G.1.6

  5. arXiv:1909.12340  [pdf, other

    cs.LG stat.ML

    At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

    Authors: Niv Giladi, Mor Shpigel Nacson, Elad Hoffer, Daniel Soudry

    Abstract: Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more scalable. However, asynchronous training has its pitfalls, mainly a degradation in generalization, even after convergence of the algorithm. This gap remains not w… ▽ More

    Submitted 13 February, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

    Comments: ICLR 2020 Camera ready version

  6. arXiv:1905.07325  [pdf, ps, other

    stat.ML cs.LG

    Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

    Authors: Mor Shpigel Nacson, Suriya Gunasekar, Jason D. Lee, Nathan Srebro, Daniel Soudry

    Abstract: With an eye toward understanding complexity control in deep learning, we study how infinitesimal regularization or gradient descent optimization lead to margin maximizing solutions in both homogeneous and non-homogeneous models, extending previous work that focused on infinitesimal regularization only in homogeneous models. To this end we study the limit of loss minimization with a diverging norm… ▽ More

    Submitted 17 May, 2019; originally announced May 2019.

    Comments: ICML Camera ready version

  7. arXiv:1806.01796  [pdf, other

    stat.ML cs.LG

    Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

    Authors: Mor Shpigel Nacson, Nathan Srebro, Daniel Soudry

    Abstract: Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate - in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous works assumed either a vanishing learning rate, iterate averaging, or loss assumptions that do no… ▽ More

    Submitted 18 April, 2022; v1 submitted 5 June, 2018; originally announced June 2018.

    Comments: Fixed a typo (Eq. (4) - missing σ_{max}^2 term in the denominator)

  8. arXiv:1803.01905  [pdf, other

    stat.ML cs.LG

    Convergence of Gradient Descent on Separable Data

    Authors: Mor Shpigel Nacson, Jason D. Lee, Suriya Gunasekar, Pedro H. P. Savarese, Nathan Srebro, Daniel Soudry

    Abstract: We provide a detailed study on the implicit bias of gradient descent when optimizing loss functions with strictly monotone tails, such as the logistic loss, over separable datasets. We look at two basic questions: (a) what are the conditions on the tail of the loss function under which gradient descent converges in the direction of the $L_2$ maximum-margin separator? (b) how does the rate of margi… ▽ More

    Submitted 24 March, 2019; v1 submitted 5 March, 2018; originally announced March 2018.

    Comments: AISTATS Camera ready version

  9. arXiv:1710.10345  [pdf, ps, other

    stat.ML cs.LG

    The Implicit Bias of Gradient Descent on Separable Data

    Authors: Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro

    Abstract: We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the max-margin (hard margin SVM) solution. The result also generalizes to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a d… ▽ More

    Submitted 16 April, 2024; v1 submitted 27 October, 2017; originally announced October 2017.

    Comments: Change from v5: clarified the derivation between eqs. (41) and (42)