Skip to main content

Showing 1–16 of 16 results for author: Schlag, I

.
  1. arXiv:2405.19279  [pdf, other

    cs.LG

    Understanding and Minimising Outlier Features in Neural Network Training

    Authors: Bobby He, Lorenzo Noci, Daniele Paliotta, Imanol Schlag, Thomas Hofmann

    Abstract: Outlier Features (OF) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quantisation in afflicted models. Despite their practical importance, little is known behind why OFs emerge during training, nor how one can minimise them.… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  2. arXiv:2404.07982  [pdf, other

    cs.CL cs.LG

    Language Imbalance Can Boost Cross-lingual Generalisation

    Authors: Anton Schäfer, Shauli Ravfogel, Thomas Hofmann, Tiago Pimentel, Imanol Schlag

    Abstract: Multilinguality is crucial for extending recent advancements in language modelling to diverse linguistic communities. To maintain high performance while representing multiple languages, multilingual models ideally align representations, allowing what is learned in one language to generalise to others. Prior research has emphasised the importance of parallel data and shared vocabulary elements as k… ▽ More

    Submitted 13 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

    ACM Class: I.2.7

  3. arXiv:2404.06508  [pdf, other

    cs.CL cs.LG

    On the Effect of (Near) Duplicate Subwords in Language Modelling

    Authors: Anton Schäfer, Thomas Hofmann, Imanol Schlag, Tiago Pimentel

    Abstract: Tokenisation is a core part of language models (LMs). It involves splitting a character sequence into subwords which are assigned arbitrary indices before being served to the LM. While typically lossless, however, this process may lead to less sample efficient LM training: as it removes character-level information, it could make it harder for LMs to generalise across similar subwords, such as now… ▽ More

    Submitted 2 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

    ACM Class: I.2.7

  4. arXiv:2311.03233  [pdf, other

    cs.LG cs.CV

    Navigating Scaling Laws: Compute Optimality in Adaptive Model Training

    Authors: Sotiris Anagnostidis, Gregor Bachmann, Imanol Schlag, Thomas Hofmann

    Abstract: In recent years, the state-of-the-art in deep learning has been dominated by very large models that have been pre-trained on vast amounts of data. The paradigm is very simple: investing more computational resources (optimally) leads to better performance, and even predictably so; neural scaling laws have been derived that accurately forecast the performance of a network for a desired level of comp… ▽ More

    Submitted 23 May, 2024; v1 submitted 6 November, 2023; originally announced November 2023.

  5. arXiv:2309.11197  [pdf, other

    cs.LG cs.CL

    The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute

    Authors: Aleksandar Stanić, Dylan Ashley, Oleg Serikov, Louis Kirsch, Francesco Faccio, Jürgen Schmidhuber, Thomas Hofmann, Imanol Schlag

    Abstract: The Languini Kitchen serves as both a research collective and codebase designed to empower researchers with limited computational resources to contribute meaningfully to the field of language modelling. We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. The number of tokens on which a model is trained is defined by the m… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

  6. arXiv:2305.17066  [pdf, other

    cs.AI cs.CL cs.CV cs.LG cs.MA

    Mindstorms in Natural Language-Based Societies of Mind

    Authors: Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, **jie Mai, Piotr Piękos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanić, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-** Fan, Bernard Ghanem , et al. (1 additional authors not shown)

    Abstract: Both Minsky's "society of mind" and Schmidhuber's "learning to think" inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a "mindstorm." Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overco… ▽ More

    Submitted 26 May, 2023; originally announced May 2023.

    Comments: 9 pages in main text + 7 pages of references + 38 pages of appendices, 14 figures in main text + 13 in appendices, 7 tables in appendices

    MSC Class: 68T07 ACM Class: I.2.6; I.2.11

  7. arXiv:2305.05364  [pdf, other

    cs.LG cs.AI cs.CL

    Large Language Model Programs

    Authors: Imanol Schlag, Sainbayar Sukhbaatar, Asli Celikyilmaz, Wen-tau Yih, Jason Weston, Jürgen Schmidhuber, Xian Li

    Abstract: In recent years, large pre-trained language models (LLMs) have demonstrated the ability to follow instructions and perform novel tasks from a few examples. The possibility to parameterise an LLM through such in-context examples widens their capability at a much lower cost than finetuning. We extend this line of reasoning and present a method which further expands the capabilities of an LLM by embe… ▽ More

    Submitted 9 May, 2023; originally announced May 2023.

  8. arXiv:2206.14858  [pdf, other

    cs.CL cs.AI cs.LG

    Solving Quantitative Reasoning Problems with Language Models

    Authors: Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

    Abstract: Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva, a large language model pretrained o… ▽ More

    Submitted 30 June, 2022; v1 submitted 29 June, 2022; originally announced June 2022.

    Comments: 12 pages, 5 figures + references and appendices

  9. arXiv:2203.07852  [pdf, other

    cs.LG cs.AI cs.NE

    Block-Recurrent Transformers

    Authors: DeLesley Hutchins, Imanol Schlag, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur

    Abstract: We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is stri… ▽ More

    Submitted 1 November, 2022; v1 submitted 11 March, 2022; originally announced March 2022.

    Comments: Update to NeurIPS camera-ready version

  10. arXiv:2202.05780  [pdf, other

    cs.LG

    A Modern Self-Referential Weight Matrix That Learns to Modify Itself

    Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

    Abstract: The weight matrix (WM) of a neural network (NN) is its program. The programs of many traditional NNs are learned through gradient descent in some error function, then remain fixed. The WM of a self-referential NN, however, can keep rapidly modifying all of itself during runtime. In principle, such NNs can meta-learn to learn, and meta-meta-learn to meta-learn to learn, and so on, in the sense of r… ▽ More

    Submitted 17 June, 2022; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: Accepted to ICML 2022

  11. arXiv:2112.15550  [pdf, other

    cs.LG cs.CV

    Improving Baselines in the Wild

    Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

    Abstract: We share our experience with the recently released WILDS benchmark, a collection of ten datasets dedicated to develo** models and training strategies which are robust to domain shifts. Several experiments yield a couple of critical observations which we believe are of general interest for any future work on WILDS. Our study focuses on two datasets: iWildCam and FMoW. We show that (1) Conducting… ▽ More

    Submitted 31 December, 2021; originally announced December 2021.

    Comments: Presented at NeurIPS 2021 Workshop on Distribution Shifts, https://openreview.net/forum?id=9vxOrkNTs1x

  12. arXiv:2106.06295  [pdf, other

    cs.LG

    Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

    Authors: Kazuki Irie, Imanol Schlag, Róbert Csordás, Jürgen Schmidhuber

    Abstract: Transformers with linearised attention (''linear Transformers'') have demonstrated the practical scalability and effectiveness of outer product-based Fast Weight Programmers (FWPs) from the '90s. However, the original FWP formulation is more general than the one of linear Transformers: a slow neural network (NN) continually reprograms the weights of a fast NN with arbitrary architecture. In existi… ▽ More

    Submitted 26 October, 2021; v1 submitted 11 June, 2021; originally announced June 2021.

    Comments: Accepted to NeurIPS 2021

  13. arXiv:2102.11174  [pdf, other

    cs.LG

    Linear Transformers Are Secretly Fast Weight Programmers

    Authors: Imanol Schlag, Kazuki Irie, Jürgen Schmidhuber

    Abstract: We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast weights" of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values). Such Fast Weight Pro… ▽ More

    Submitted 9 June, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

  14. arXiv:2011.07831  [pdf, other

    cs.LG cs.NE

    Learning Associative Inference Using Fast Weight Memory

    Authors: Imanol Schlag, Tsendsuren Munkhdalai, Jürgen Schmidhuber

    Abstract: Humans can quickly associate stimuli to solve problems in novel contexts. Our novel neural network model learns state representations of facts that can be composed to perform such associative inference. To this end, we augment the LSTM model with an associative memory, dubbed Fast Weight Memory (FWM). Through differentiable operations at every step of a given input sequence, the LSTM updates and m… ▽ More

    Submitted 23 February, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

  15. arXiv:1910.06611  [pdf, other

    cs.LG stat.ML

    Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving

    Authors: Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, Jürgen Schmidhuber, Jianfeng Gao

    Abstract: We incorporate Tensor-Product Representations within the Transformer in order to better support the explicit representation of relation structure. Our Tensor-Product Transformer (TP-Transformer) sets a new state of the art on the recently-introduced Mathematics Dataset containing 56 categories of free-form math word-problems. The essential component of the model is a novel attention mechanism, cal… ▽ More

    Submitted 4 November, 2020; v1 submitted 15 October, 2019; originally announced October 2019.

  16. arXiv:1811.12143  [pdf, other

    cs.LG cs.NE stat.ML

    Learning to Reason with Third-Order Tensor Products

    Authors: Imanol Schlag, Jürgen Schmidhuber

    Abstract: We combine Recurrent Neural Networks with Tensor Product Representations to learn combinatorial representations of sequential data. This improves symbolic interpretation and systematic generalisation. Our architecture is trained end-to-end through gradient descent on a variety of simple natural language reasoning tasks, significantly outperforming the latest state-of-the-art models in single-task… ▽ More

    Submitted 8 January, 2019; v1 submitted 29 November, 2018; originally announced November 2018.