Skip to main content

Showing 1–15 of 15 results for author: von Oswald, J

.
  1. arXiv:2406.08423  [pdf, other

    cs.LG cs.AI

    State Soup: In-Context Skill Learning, Retrieval and Mixing

    Authors: Maciej Pióro, Maciej Wołczyk, Razvan Pascanu, Johannes von Oswald, João Sacramento

    Abstract: A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter inte… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  2. arXiv:2402.14180  [pdf, other

    cs.LG

    Linear Transformers are Versatile In-Context Learners

    Authors: Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge

    Abstract: Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as… ▽ More

    Submitted 21 February, 2024; originally announced February 2024.

  3. arXiv:2312.15001  [pdf, other

    cs.LG cs.NE

    Discovering modular solutions that generalize compositionally

    Authors: Simon Schug, Sei** Kobayashi, Yassir Akram, Maciej Wołczyk, Alexandra Proca, Johannes von Oswald, Razvan Pascanu, João Sacramento, Angelika Steger

    Abstract: Many complex tasks can be decomposed into simpler, independent parts. Discovering such underlying compositional structure has the potential to enable compositional generalization. Despite progress, our most powerful systems struggle to compose flexibly. It therefore seems natural to make models more modular to help capture the compositional nature of many tasks. However, it is unclear under which… ▽ More

    Submitted 25 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

    Comments: Published as a conference paper at ICLR 2024; Code available at https://github.com/smonsays/modular-hyperteacher

  4. arXiv:2309.05858  [pdf, other

    cs.LG cs.AI

    Uncovering mesa-optimization algorithms in Transformers

    Authors: Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Sei** Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento

    Abstract: Transformers have become the dominant model in deep learning, but the reason for their superior performance is poorly understood. Here, we hypothesize that the strong performance of Transformers stems from an architectural bias towards mesa-optimization, a learned process running within the forward pass of a model consisting of the following two steps: (i) the construction of an internal learning… ▽ More

    Submitted 11 September, 2023; originally announced September 2023.

  5. arXiv:2309.01775  [pdf, other

    cs.LG cs.NE

    Gated recurrent neural networks discover attention

    Authors: Nicolas Zucchet, Sei** Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento

    Abstract: Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement… ▽ More

    Submitted 7 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

  6. arXiv:2212.07677  [pdf, other

    cs.LG cs.AI cs.CL

    Transformers learn in-context by gradient descent

    Authors: Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

    Abstract: At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linea… ▽ More

    Submitted 31 May, 2023; v1 submitted 15 December, 2022; originally announced December 2022.

  7. arXiv:2210.09818  [pdf, other

    cs.LG

    Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel

    Authors: Sei** Kobayashi, Pau Vilimelis Aceituno, Johannes von Oswald

    Abstract: Identifying unfamiliar inputs, also known as out-of-distribution (OOD) detection, is a crucial property of any decision making process. A simple and empirically validated technique is based on deep ensembles where the variance of predictions over different neural networks acts as a substitute for input uncertainty. Nevertheless, a theoretical understanding of the inductive biases leading to the pe… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

  8. arXiv:2209.07509  [pdf, other

    cs.LG

    Random initialisations performing above chance and how to find them

    Authors: Frederik Benzing, Simon Schug, Robert Meier, Johannes von Oswald, Yassir Akram, Nicolas Zucchet, Laurence Aitchison, Angelika Steger

    Abstract: Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al.\ recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after t… ▽ More

    Submitted 7 November, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

    Comments: NeurIPS 2022, 14th Annual Workshop on Optimization for Machine Learning (OPT2022)

  9. arXiv:2207.01332  [pdf, other

    cs.LG cs.NE

    The least-control principle for local learning at equilibrium

    Authors: Alexander Meulemans, Nicolas Zucchet, Sei** Kobayashi, Johannes von Oswald, João Sacramento

    Abstract: Equilibrium systems are a powerful way to express neural computations. As special cases, they include models of great current interest in both neuroscience and machine learning, such as deep neural networks, equilibrium recurrent neural networks, deep equilibrium models, or meta-learning. Here, we present a new principle for learning such systems with a temporally- and spatially-local rule. Our pr… ▽ More

    Submitted 31 October, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Published at NeurIPS 2022. 56 pages

    MSC Class: 68T07 ACM Class: I.2.6

  10. arXiv:2110.14402  [pdf, other

    cs.LG cs.NE

    Learning where to learn: Gradient sparsity in meta and continual learning

    Authors: Johannes von Oswald, Dominic Zhao, Sei** Kobayashi, Simon Schug, Massimo Caccia, Nicolas Zucchet, João Sacramento

    Abstract: Finding neural network weights that generalize well from small datasets is difficult. A promising approach is to learn a weight initialization such that a small number of weight changes results in low generalization error. We show that this form of meta-learning can be improved by letting the learning algorithm decide which weights to change, i.e., by learning where to learn. We find that patterne… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Published at NeurIPS 2021

  11. arXiv:2104.01677  [pdf, other

    cs.LG cs.NE q-bio.NC

    A contrastive rule for meta-learning

    Authors: Nicolas Zucchet, Simon Schug, Johannes von Oswald, Dominic Zhao, João Sacramento

    Abstract: Humans and other animals are capable of improving their learning performance as they solve related tasks from a given problem domain, to the point of being able to learn from extremely limited data. While synaptic plasticity is generically thought to underlie learning in the brain, the precise neural and synaptic mechanisms by which learning processes improve through experience are not well unders… ▽ More

    Submitted 3 October, 2022; v1 submitted 4 April, 2021; originally announced April 2021.

    Comments: 32 pages, 10 figures, published at NeurIPS 2022

  12. arXiv:2103.01133  [pdf, other

    cs.LG cs.AI

    Posterior Meta-Replay for Continual Learning

    Authors: Christian Henning, Maria R. Cervera, Francesco D'Angelo, Johannes von Oswald, Regina Traber, Benjamin Ehret, Sei** Kobayashi, Benjamin F. Grewe, João Sacramento

    Abstract: Learning a sequence of tasks without access to i.i.d. observations is a widely studied form of continual learning (CL) that remains challenging. In principle, Bayesian learning directly applies to this setting, since recursive and one-off Bayesian updates yield the same result. In practice, however, recursive updating often leads to poor trade-off solutions across tasks because approximate inferen… ▽ More

    Submitted 21 October, 2021; v1 submitted 1 March, 2021; originally announced March 2021.

    Comments: Published at NeurIPS 2021

  13. arXiv:2007.12927  [pdf, other

    cs.LG cs.CV stat.ML

    Neural networks with late-phase weights

    Authors: Johannes von Oswald, Sei** Kobayashi, Alexander Meulemans, Christian Henning, Benjamin F. Grewe, João Sacramento

    Abstract: The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring incre… ▽ More

    Submitted 11 April, 2022; v1 submitted 25 July, 2020; originally announced July 2020.

    Comments: 25 pages, 6 figures

    Journal ref: Published as a conference paper at ICLR 2021

  14. arXiv:2006.12109  [pdf, other

    cs.LG stat.ML

    Continual Learning in Recurrent Neural Networks

    Authors: Benjamin Ehret, Christian Henning, Maria R. Cervera, Alexander Meulemans, Johannes von Oswald, Benjamin F. Grewe

    Abstract: While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on th… ▽ More

    Submitted 10 March, 2021; v1 submitted 22 June, 2020; originally announced June 2020.

    Comments: Published at ICLR 2021

  15. arXiv:1906.00695  [pdf, other

    cs.LG cs.AI stat.ML

    Continual learning with hypernetworks

    Authors: Johannes von Oswald, Christian Henning, Benjamin F. Grewe, João Sacramento

    Abstract: Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks. To overcome this problem, we present a novel approach based on task-conditioned hypernetworks, i.e., networks that generate the weights of a target model based on task identity. Continual learning (CL) is less difficult for this class of models thanks to a simple key feature: instea… ▽ More

    Submitted 11 April, 2022; v1 submitted 3 June, 2019; originally announced June 2019.

    Comments: Published at ICLR 2020

    MSC Class: 68T99