Skip to main content

Showing 1–11 of 11 results for author: Şimşek, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.18724  [pdf, other

    cs.LG cs.AI stat.ML

    Learning Associative Memories with Gradient Descent

    Authors: Vivien Cabannes, Berfin Simsek, Alberto Bietti

    Abstract: This work focuses on the training dynamics of one associative memory module storing outer products of token embeddings. We reduce this problem to the study of a system of particles, which interact according to properties of the data distribution and correlations between embeddings. Through theory and experiments, we provide several insights. In overparameterized regimes, we obtain logarithmic grow… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  2. arXiv:2402.05626  [pdf, other

    cs.LG

    Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Esca**, and Network Embedding

    Authors: Zhengqing Wu, Berfin Simsek, Francois Ged

    Abstract: In this paper, we investigate the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss. As the activation function is non-differentiable, it is so far unclear how to completely characterize the stationary points. We propose the conditions for stationarity that apply to both non-differentiable and differentiable cases. Additi… ▽ More

    Submitted 11 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

  3. arXiv:2311.01644  [pdf, other

    cs.LG cs.NE stat.ML

    Should Under-parameterized Student Networks Copy or Average Teacher Weights?

    Authors: Berfin Şimşek, Amire Bendjeddou, Wulfram Gerstner, Johanni Brea

    Abstract: Any continuous function $f^*$ can be approximated arbitrarily well by a neural network with sufficiently many neurons $k$. We consider the case when $f^*$ itself is a neural network with one hidden layer and $k$ neurons. Approximating $f^*$ with a neural network with $n< k$ neurons can thus be seen as fitting an under-parameterized "student" network with $n$ neurons to a "teacher" network with… ▽ More

    Submitted 15 January, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

    Comments: 41 pages, presented at NeurIPS 2023

  4. arXiv:2304.12794  [pdf, other

    cs.NE

    Expand-and-Cluster: Parameter Recovery of Neural Networks

    Authors: Flavio Martinelli, Berfin Simsek, Wulfram Gerstner, Johanni Brea

    Abstract: Can we identify the weights of a neural network by probing its input-output map**? At first glance, this problem seems to have many solutions because of permutation, overparameterisation and activation function symmetries. Yet, we show that the incoming weight vector of each neuron is identifiable up to sign or scaling, depending on the activation function. Our novel method 'Expand-and-Cluster'… ▽ More

    Submitted 27 June, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted paper at ICML '24

  5. arXiv:2301.10638  [pdf, ps, other

    cs.LG

    MLPGradientFlow: going with the flow of multilayer perceptrons (and finding minima fast and accurately)

    Authors: Johanni Brea, Flavio Martinelli, Berfin Şimşek, Wulfram Gerstner

    Abstract: MLPGradientFlow is a software package to solve numerically the gradient flow differential equation $\dot θ= -\nabla \mathcal L(θ; \mathcal D)$, where $θ$ are the parameters of a multi-layer perceptron, $\mathcal D$ is some data set, and $\nabla \mathcal L$ is the gradient of a loss function. We show numerically that adaptive first- or higher-order integration methods based on Runge-Kutta schemes h… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

  6. arXiv:2203.15100  [pdf, other

    cs.LG cs.CV

    Understanding out-of-distribution accuracies through quantifying difficulty of test samples

    Authors: Berfin Simsek, Melissa Hall, Levent Sagun

    Abstract: Existing works show that although modern neural networks achieve remarkable generalization performance on the in-distribution (ID) dataset, the accuracy drops significantly on the out-of-distribution (OOD) datasets \cite{recht2018cifar, recht2019imagenet}. To understand why a variety of models consistently make more mistakes in the OOD datasets, we propose a new metric to quantify the difficulty o… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: 18 pages, 15 figures

  7. arXiv:2106.15933  [pdf, other

    stat.ML cs.LG

    Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and Sparsity

    Authors: Arthur Jacot, François Ged, Berfin Şimşek, Clément Hongler, Franck Gabriel

    Abstract: The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $σ^2$ of the parameters at initialization $θ_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $γ$ of the variance $σ^2=w^{-γ}$ as $w\to\infty$: for large variance ($γ<1$), $θ_0$ is very close to a global minimum but far from any saddle point, and for small variance ($γ>1$), $θ_0$ is close t… ▽ More

    Submitted 31 January, 2022; v1 submitted 30 June, 2021; originally announced June 2021.

  8. arXiv:2105.12221  [pdf, other

    cs.LG

    Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances

    Authors: Berfin Şimşek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, Johanni Brea

    Abstract: We study how permutation symmetries in overparameterized multi-layer neural networks generate `symmetry-induced' critical points. Assuming a network with $ L $ layers of minimal widths $ r_1^*, \ldots, r_{L-1}^* $ reaches a zero-loss minimum at $ r_1^*! \cdots r_{L-1}^*! $ isolated points that are permutations of one another, we show that adding one extra neuron to each layer is sufficient to conn… ▽ More

    Submitted 12 September, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

    Comments: 29 pages, 12 figures, ICML 2021

  9. arXiv:2006.09796  [pdf, other

    stat.ML cs.LG math.PR

    Kernel Alignment Risk Estimator: Risk Prediction from Training Data

    Authors: Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, Franck Gabriel

    Abstract: We study the risk (i.e. generalization error) of Kernel Ridge Regression (KRR) for a kernel $K$ with ridge $λ>0$ and i.i.d. observations. For this, we introduce two objects: the Signal Capture Threshold (SCT) and the Kernel Alignment Risk Estimator (KARE). The SCT $\vartheta_{K,λ}$ is a function of the data distribution: it can be used to identify the components of the data that the KRR predictor… ▽ More

    Submitted 17 June, 2020; originally announced June 2020.

  10. arXiv:2002.08404  [pdf, other

    stat.ML cs.LG

    Implicit Regularization of Random Feature Models

    Authors: Arthur Jacot, Berfin Şimşek, Francesco Spadaro, Clément Hongler, Franck Gabriel

    Abstract: Random Feature (RF) models are used as efficient parametric approximations of kernel methods. We investigate, by means of random matrix theory, the connection between Gaussian RF models and Kernel Ridge Regression (KRR). For a Gaussian RF model with $P$ features, $N$ data points, and a ridge $λ$, we show that the average (i.e. expected) RF predictor is close to a KRR predictor with an effective ri… ▽ More

    Submitted 23 September, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

    Journal ref: Proceedings of the International Conference on Machine Learning, 2020, pp. 7397-7406

  11. arXiv:1907.02911  [pdf, other

    cs.LG stat.ML

    Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

    Authors: Johanni Brea, Berfin Simsek, Bernd Illing, Wulfram Gerstner

    Abstract: The permutation symmetry of neurons in each layer of a deep neural network gives rise not only to multiple equivalent global minima of the loss function, but also to first-order saddle points located on the path between the global minima. In a network of $d-1$ hidden layers with $n_k$ neurons in layers $k = 1, \ldots, d$, we construct smooth paths between equivalent global minima that lead through… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.