Skip to main content

Showing 1–17 of 17 results for author: Saxe, A M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.07129  [pdf, other

    cs.LG

    What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation

    Authors: Aaditya K. Singh, Ted Moskovitz, Felix Hill, Stephanie C. Y. Chan, Andrew M. Saxe

    Abstract: In-context learning is a powerful emergent ability in transformer models. Prior work in mechanistic interpretability has identified a circuit element that may be critical for in-context learning -- the induction head (IH), which performs a match-and-copy operation. During training of large transformers on natural language data, IHs emerge around the same time as a notable phase change in the loss.… ▽ More

    Submitted 10 April, 2024; originally announced April 2024.

    Comments: 26 pages, 18 figures

  2. arXiv:2402.09142  [pdf, other

    cs.LG q-bio.NC

    When Representations Align: Universality in Representation Learning Dynamics

    Authors: Loek van Rossem, Andrew M. Saxe

    Abstract: Deep neural networks come in many sizes and architectures. The choice of architecture, in conjunction with the dataset and learning algorithm, is commonly understood to affect the learned neural representations. Yet, recent results have shown that different architectures learn representations with striking qualitative similarities. Here we derive an effective theory of representation learning unde… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: 22 pages, 16 figures

  3. arXiv:2311.08360  [pdf, other

    cs.LG cs.AI cs.CL

    The Transient Nature of Emergent In-Context Learning in Transformers

    Authors: Aaditya K. Singh, Stephanie C. Y. Chan, Ted Moskovitz, Erin Grant, Andrew M. Saxe, Felix Hill

    Abstract: Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is t… ▽ More

    Submitted 11 December, 2023; v1 submitted 14 November, 2023; originally announced November 2023.

    Comments: 19 pages, 16 figures

  4. arXiv:2310.19919  [pdf, other

    cs.NE cs.LG q-bio.NC

    Meta-Learning Strategies through Value Maximization in Neural Networks

    Authors: Rodrigo Carrasco-Davis, Javier Masís, Andrew M. Saxe

    Abstract: Biological and artificial learning agents face numerous choices about how to learn, ranging from hyperparameter selection to aspects of task distributions like curricula. Understanding how to make these meta-learning choices could offer normative accounts of cognitive control functions in biological learners and improve engineered systems. Yet optimal strategies remain challenging to compute in mo… ▽ More

    Submitted 30 October, 2023; originally announced October 2023.

    Comments: Under Review

  5. arXiv:2302.11351  [pdf, other

    cs.AI q-bio.NC

    Abrupt and spontaneous strategy switches emerge in simple regularised neural networks

    Authors: Anika T. Löwe, Léo Touzo, Paul S. Muhle-Karbe, Andrew M. Saxe, Christopher Summerfield, Nicolas W. Schuck

    Abstract: Humans sometimes have an insight that leads to a sudden and drastic performance improvement on the task they are working on. Sudden strategy adaptations are often linked to insights, considered to be a unique aspect of human cognition tied to complex processes such as creativity or meta-cognitive reasoning. Here, we take a learning perspective and ask whether insight-like behaviour can occur in si… ▽ More

    Submitted 1 March, 2024; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: 17 pages, 5 figures

  6. arXiv:2207.10430  [pdf, other

    cs.LG cs.AI

    The Neural Race Reduction: Dynamics of Abstraction in Gated Networks

    Authors: Andrew M. Saxe, Shagun Sodhani, Sam Lewallen

    Abstract: Our theoretical understanding of deep learning has not kept pace with its empirical success. While network architecture is known to be critical, we do not yet understand its effect on learned representations and network behavior, or how this architecture should reflect task structure.In this work, we begin to address this gap by introducing the Gated Deep Linear Network framework that schematizes… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: ICML 2022; 23 pages; 10 figures

  7. arXiv:1906.08632  [pdf, other

    stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG

    Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

    Authors: Sebastian Goldt, Madhu S. Advani, Andrew M. Saxe, Florent Krzakala, Lenka Zdeborová

    Abstract: Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynam… ▽ More

    Submitted 27 October, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: 9 pages + references + supplemental material. Oral presentation at NeurIPS 2019. arXiv admin note: substantial text overlap with arXiv:1901.09085

    Journal ref: J. Stat. Mech. 2020 124010 & NeurIPS 2019

  8. arXiv:1901.09085  [pdf, other

    stat.ML cond-mat.dis-nn cond-mat.stat-mech cs.LG

    Generalisation dynamics of online learning in over-parameterised neural networks

    Authors: Sebastian Goldt, Madhu S. Advani, Andrew M. Saxe, Florent Krzakala, Lenka Zdeborová

    Abstract: Deep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show ho… ▽ More

    Submitted 25 January, 2019; originally announced January 2019.

    Comments: 25 pages, 13 figures

    Journal ref: Presented at the ICML 2019 Workshop on Theoretical Physics for Deep Learning

  9. arXiv:1810.10531  [pdf, other

    cs.LG cs.AI q-bio.NC stat.ML

    A mathematical theory of semantic development in deep neural networks

    Authors: Andrew M. Saxe, James L. McClelland, Surya Ganguli

    Abstract: An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: what are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual expe… ▽ More

    Submitted 23 October, 2018; originally announced October 2018.

  10. arXiv:1806.00730  [pdf, other

    stat.ML cs.LG cs.NE

    Minnorm training: an algorithm for training over-parameterized deep neural networks

    Authors: Yamini Bansal, Madhu Advani, David D Cox, Andrew M Saxe

    Abstract: In this work, we propose a new training method for finding minimum weight norm solutions in over-parameterized neural networks (NNs). This method seeks to improve training speed and generalization performance by framing NN training as a constrained optimization problem wherein the sum of the norm of the weights in each layer of the network is minimized, under the constraint of exactly fitting trai… ▽ More

    Submitted 21 June, 2018; v1 submitted 2 June, 2018; originally announced June 2018.

  11. arXiv:1803.01927  [pdf, other

    cs.LG cond-mat.stat-mech stat.ML

    Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning

    Authors: Yao Zhang, Andrew M. Saxe, Madhu S. Advani, Alpha A. Lee

    Abstract: Finding parameters that minimise a loss function is at the core of many machine learning methods. The Stochastic Gradient Descent algorithm is widely used and delivers state of the art results for many problems. Nonetheless, Stochastic Gradient Descent typically cannot find the global minimum, thus its empirical effectiveness is hitherto mysterious. We derive a correspondence between parameter inf… ▽ More

    Submitted 5 March, 2018; originally announced March 2018.

  12. arXiv:1710.03667  [pdf, other

    stat.ML cs.LG physics.data-an q-bio.NC

    High-dimensional dynamics of generalization error in neural networks

    Authors: Madhu S. Advani, Andrew M. Saxe

    Abstract: We perform an average case analysis of the generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generali… ▽ More

    Submitted 10 October, 2017; originally announced October 2017.

  13. arXiv:1708.00463  [pdf, other

    cs.AI

    Hierarchical Subtask Discovery With Non-Negative Matrix Factorization

    Authors: Adam C. Earle, Andrew M. Saxe, Benjamin Rosman

    Abstract: Hierarchical reinforcement learning methods offer a powerful means of planning flexible behavior in complicated domains. However, learning an appropriate hierarchical decomposition of a domain into subtasks remains a substantial challenge. We present a novel algorithm for subtask discovery, based on the recently introduced multitask linearly-solvable Markov decision process (MLMDP) framework. The… ▽ More

    Submitted 1 August, 2017; originally announced August 2017.

    Comments: 7 pages, Accepted at Lifelong Learning: A Reinforcement Learning Approach Workshop, ICML, Sydney, Australia, 2017

  14. arXiv:1612.02757  [pdf, other

    cs.AI

    Hierarchy through Composition with Linearly Solvable Markov Decision Processes

    Authors: Andrew M. Saxe, Adam Earle, Benjamin Rosman

    Abstract: Hierarchical architectures are critical to the scalability of reinforcement learning methods. Current hierarchical frameworks execute actions serially, with macro-actions comprising sequences of primitive actions. We propose a novel alternative to these control hierarchies based on concurrent execution of many actions in parallel. Our scheme uses the concurrent compositionality provided by the lin… ▽ More

    Submitted 8 December, 2016; originally announced December 2016.

    Comments: 9 pages, 3 figures

  15. arXiv:1606.02355  [pdf, other

    cs.LG cs.AI stat.ML

    Active Long Term Memory Networks

    Authors: Tommaso Furlanello, Jia** Zhao, Andrew M. Saxe, Laurent Itti, Bosco S. Tjan

    Abstract: Continual Learning in artificial neural networks suffers from interference and forgetting when different tasks are learned sequentially. This paper introduces the Active Long Term Memory Networks (A-LTM), a model of sequential multi-task deep learning that is able to maintain previously learned association between sensory input and behavioral output while acquiring knew knowledge. A-LTM exploits t… ▽ More

    Submitted 7 June, 2016; originally announced June 2016.

  16. arXiv:1412.6544  [pdf, other

    cs.NE cs.LG stat.ML

    Qualitatively characterizing neural network optimization problems

    Authors: Ian J. Goodfellow, Oriol Vinyals, Andrew M. Saxe

    Abstract: Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct t… ▽ More

    Submitted 21 May, 2015; v1 submitted 19 December, 2014; originally announced December 2014.

  17. arXiv:1312.6120  [pdf, other

    cs.NE cond-mat.dis-nn cs.CV cs.LG q-bio.NC stat.ML

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Authors: Andrew M. Saxe, James L. McClelland, Surya Ganguli

    Abstract: Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map,… ▽ More

    Submitted 19 February, 2014; v1 submitted 20 December, 2013; originally announced December 2013.

    Comments: Submission to ICLR2014. Revised based on reviewer feedback