Search | arXiv e-print repository

A phase transition between positional and semantic learning in a solvable model of dot-product attention

Authors: Hugo Cui, Freya Behrens, Florent Krzakala, Lenka Zdeborová

Abstract: We investigate how a dot-product attention layer learns a positional attention matrix (with tokens attending to each other based on their respective positions) and a semantic attention matrix (with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally show how the same simple architecture can learn to implement a solution using either the positional or… ▽ More We investigate how a dot-product attention layer learns a positional attention matrix (with tokens attending to each other based on their respective positions) and a semantic attention matrix (with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally show how the same simple architecture can learn to implement a solution using either the positional or semantic mechanism. On the theoretical side, we study the learning of a non-linear self-attention layer with trainable tied and low-rank query and key matrices. In the asymptotic limit of high-dimensional data and a comparably large number of training samples, we provide a closed-form characterization of the global minimum of the non-convex empirical loss landscape. We show that this minimum corresponds to either a positional or a semantic mechanism and evidence an emergent phase transition from the former to the latter with increasing sample complexity. Finally, we compare the dot-product attention layer to linear positional baseline, and show that it outperforms the latter using the semantic mechanism provided it has access to sufficient data. △ Less

Submitted 6 February, 2024; originally announced February 2024.

arXiv:2306.07104 [pdf, other]

Unveiling the Hessian's Connection to the Decision Boundary

Authors: Mahalakshmi Sabanayagam, Freya Behrens, Urte Adomaityte, Anna Dawid

Abstract: Understanding the properties of well-generalizing minima is at the heart of deep learning research. On the one hand, the generalization of neural networks has been connected to the decision boundary complexity, which is hard to study in the high-dimensional input space. Conversely, the flatness of a minimum has become a controversial proxy for generalization. In this work, we provide the missing l… ▽ More Understanding the properties of well-generalizing minima is at the heart of deep learning research. On the one hand, the generalization of neural networks has been connected to the decision boundary complexity, which is hard to study in the high-dimensional input space. Conversely, the flatness of a minimum has become a controversial proxy for generalization. In this work, we provide the missing link between the two approaches and show that the Hessian top eigenvectors characterize the decision boundary learned by the neural network. Notably, the number of outliers in the Hessian spectrum is proportional to the complexity of the decision boundary. Based on this finding, we provide a new and straightforward approach to studying the complexity of a high-dimensional decision boundary; show that this connection naturally inspires a new generalization measure; and finally, we develop a novel margin estimation technique which, in combination with the generalization measure, precisely identifies minima with simple wide-margin boundaries. Overall, this analysis establishes the connection between the Hessian and the decision boundary and provides a new method to identify minima with simple wide-margin decision boundaries. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: 14 pages, 6 figures + 18-page appendices with 19 figures. Any feedback is very welcome! Code is available at https://github.com/Shmoo137/Hessian-and-Decision-Boundary

arXiv:2202.10379 [pdf, other]

doi 10.1088/1751-8121/ac8b46

(Dis)assortative Partitions on Random Regular Graphs

Authors: Freya Behrens, Gabriel Arpino, Yaroslav Kivva, Lenka Zdeborová

Abstract: We study the problem of assortative and disassortative partitions on random $d$-regular graphs. Nodes in the graph are partitioned into two non-empty groups. In the assortative partition every node requires at least $H$ of their neighbors to be in their own group. In the disassortative partition they require less than $H$ neighbors to be in their own group. Using the cavity method based on analysi… ▽ More We study the problem of assortative and disassortative partitions on random $d$-regular graphs. Nodes in the graph are partitioned into two non-empty groups. In the assortative partition every node requires at least $H$ of their neighbors to be in their own group. In the disassortative partition they require less than $H$ neighbors to be in their own group. Using the cavity method based on analysis of the Belief Propagation algorithm we establish for which combinations of parameters $(d,H)$ these partitions exist with high probability and for which they do not. For $H>\lceil \frac{d}{2} \rceil $ we establish that the structure of solutions to the assortative partition problems corresponds to the so-called frozen-1RSB. This entails a conjecture of algorithmic hardness of finding these partitions efficiently. For $H \le \lceil \frac{d}{2} \rceil $ we argue that the assortative partition problem is algorithmically easy on average for all $d$. Further we provide arguments about asymptotic equivalence between the assortative partition problem and the disassortative one, going trough a close relation to the problem of single-spin-flip-stable states in spin glasses. In the context of spin glasses, our results on algorithmic hardness imply a conjecture that gapped single spin flip stable states are hard to find which may be a universal reason behind the observation that physical dynamics in glassy systems display convergence to marginal stability. △ Less

Submitted 2 May, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: 21 pages; Corrected usage of the world "planted" in Section 4

Journal ref: J. Phys. A: Math. Theor. 55 395004 (2022)

arXiv:2102.03815 [pdf, other]

Bandits for Learning to Explain from Explanations

Authors: Freya Behrens, Stefano Teso, Davide Mottin

Abstract: We introduce Explearn, an online algorithm that learns to jointly output predictions and explanations for those predictions. Explearn leverages Gaussian Processes (GP)-based contextual bandits. This brings two key benefits. First, GPs naturally capture different kinds of explanations and enable the system designer to control how explanations generalize across the space by virtue of choosing a suit… ▽ More We introduce Explearn, an online algorithm that learns to jointly output predictions and explanations for those predictions. Explearn leverages Gaussian Processes (GP)-based contextual bandits. This brings two key benefits. First, GPs naturally capture different kinds of explanations and enable the system designer to control how explanations generalize across the space by virtue of choosing a suitable kernel. Second, Explearn builds on recent results in contextual bandits which guarantee convergence with high probability. Our initial experiments hint at the promise of the approach. △ Less

Submitted 7 February, 2021; originally announced February 2021.

Comments: Accepted at the Explainable Agency in Artificial Intelligence Workshop, hosted at the 35th AAAI Conference on Artificial Intelligence, February 2-9, 2021

arXiv:2010.01930 [pdf, other]

Neurally Augmented ALISTA

Authors: Freya Behrens, Jonathan Sauder, Peter Jung

Abstract: It is well-established that many iterative sparse reconstruction algorithms can be unrolled to yield a learnable neural network for improved empirical performance. A prime example is learned ISTA (LISTA) where weights, step sizes and thresholds are learned from training data. Recently, Analytic LISTA (ALISTA) has been introduced, combining the strong empirical performance of a fully learned approa… ▽ More It is well-established that many iterative sparse reconstruction algorithms can be unrolled to yield a learnable neural network for improved empirical performance. A prime example is learned ISTA (LISTA) where weights, step sizes and thresholds are learned from training data. Recently, Analytic LISTA (ALISTA) has been introduced, combining the strong empirical performance of a fully learned approach like LISTA, while retaining theoretical guarantees of classical compressed sensing algorithms and significantly reducing the number of parameters to learn. However, these parameters are trained to work in expectation, often leading to suboptimal reconstruction of individual targets. In this work we therefore introduce Neurally Augmented ALISTA, in which an LSTM network is used to compute step sizes and thresholds individually for each target vector during reconstruction. This adaptive approach is theoretically motivated by revisiting the recovery guarantees of ALISTA. We show that our approach further improves empirical performance in sparse reconstruction, in particular outperforming existing algorithms by an increasing margin as the compression ratio becomes more challenging. △ Less

Submitted 5 October, 2020; originally announced October 2020.

Comments: 10pages, 9 figures

Showing 1–5 of 5 results for author: Behrens, F