Skip to main content

Showing 1–7 of 7 results for author: Sander, M E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.05787  [pdf, other

    stat.ML cs.LG

    How do Transformers perform In-Context Autoregressive Learning?

    Authors: Michael E. Sander, Raja Giryes, Taiji Suzuki, Mathieu Blondel, Gabriel Peyré

    Abstract: Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a simple next token prediction task, where sequences are generated as a first-order autoregressive process $s_{t+1} = W s_t$. We show how a trained Transformer predi… ▽ More

    Submitted 5 June, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

    Comments: 20 pages ICML 2024

  2. arXiv:2309.01213  [pdf, other

    stat.ML cs.LG

    Implicit regularization of deep residual networks towards neural ODEs

    Authors: Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau

    Abstract: Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual ne… ▽ More

    Submitted 1 March, 2024; v1 submitted 3 September, 2023; originally announced September 2023.

    Comments: ICLR 2024 (spotlight). 40 pages, 3 figures

  3. arXiv:2302.01425  [pdf, other

    cs.LG stat.ML

    Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective

    Authors: Michael E. Sander, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel

    Abstract: The top-k operator returns a sparse vector, where the non-zero values correspond to the k largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, n… ▽ More

    Submitted 4 June, 2023; v1 submitted 2 February, 2023; originally announced February 2023.

    Comments: ICML 2023 18 pages

  4. arXiv:2210.09221  [pdf, other

    cs.CV cs.LG

    Vision Transformers provably learn spatial structure

    Authors: Samy Jelassi, Michael E. Sander, Yuanzhi Li

    Abstract: Vision Transformers (ViTs) have achieved comparable or superior performance than Convolutional Neural Networks (CNNs) in computer vision. This empirical breakthrough is even more remarkable since, in contrast to CNNs, ViTs do not embed any visual inductive bias of spatial locality. Yet, recent works have shown that while minimizing their training loss, ViTs specifically learn spatially localized p… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

  5. arXiv:2205.14612  [pdf, other

    cs.LG stat.ML

    Do Residual Neural Networks discretize Neural Ordinary Differential Equations?

    Authors: Michael E. Sander, Pierre Ablin, Gabriel Peyré

    Abstract: Neural Ordinary Differential Equations (Neural ODEs) are the continuous analog of Residual Neural Networks (ResNets). We investigate whether the discrete dynamics defined by a ResNet are close to the continuous one of a Neural ODE. We first quantify the distance between the ResNet's hidden state trajectory and the solution of its corresponding Neural ODE. Our bound is tight and, on the negative si… ▽ More

    Submitted 15 September, 2022; v1 submitted 29 May, 2022; originally announced May 2022.

    Comments: Accepted at NeurIPS 2022 24 pages

  6. arXiv:2110.11773  [pdf, other

    cs.LG stat.ML

    Sinkformers: Transformers with Doubly Stochastic Attention

    Authors: Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

    Abstract: Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise stochastic. In this paper, we propose instead to use Sinkhorn's algorithm to make attention matrices doubly stochastic. We call the resulting model a Sinkformer.… ▽ More

    Submitted 24 January, 2022; v1 submitted 22 October, 2021; originally announced October 2021.

    Comments: Accepted at AISTATS

  7. arXiv:2102.07870  [pdf, other

    cs.LG cs.AI stat.ML

    Momentum Residual Neural Networks

    Authors: Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

    Abstract: The training of deep residual neural networks (ResNets) with backpropagation has a memory cost that increases linearly with respect to the depth of the network. A way to circumvent this issue is to use reversible architectures. In this paper, we propose to change the forward rule of a ResNet by adding a momentum term. The resulting networks, momentum residual neural networks (Momentum ResNets), ar… ▽ More

    Submitted 22 July, 2021; v1 submitted 15 February, 2021; originally announced February 2021.

    Comments: 24 pages