Search | arXiv e-print repository

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Authors: Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan Braun, Jake Mendel, Kaarel Hänni, Avery Griffin, Jörn Stöhler, Magdalena Wache, Marius Hobbhahn

Abstract: Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functi… ▽ More Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream computation, producing an interaction graph that shows all computationally-relevant features and interactions in a model. We evaluate the effectiveness of LIB on modular addition and CIFAR-10 models, finding that it identifies more computationally-relevant features that interact more sparsely, compared to principal component analysis. However, LIB does not yield substantial improvements in interpretability or interaction sparsity when applied to language models. We conclude that LIB is a promising theory-driven approach for analyzing neural networks, but in its current form is not applicable to large language models. △ Less

Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.10927 [pdf, other]

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Authors: Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

Abstract: Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory t… ▽ More Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians. △ Less

Submitted 20 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

arXiv:2011.14360 [pdf, ps, other]

Asymptotics of descent functions

Authors: Kaarel Hänni

Abstract: In 1916, MacMahon showed that permutations in $S_n$ with a fixed descent set $I$ are enumerated by a polynomial $d_I(n)$. Diaz-Lopez, Harris, Insko, Omar, and Sagan recently revived interest in this descent polynomial, and suggested the direction of studying such enumerative questions for other consecutive patterns (descents being the consecutive pattern $21$). Zhu studied this question for the co… ▽ More In 1916, MacMahon showed that permutations in $S_n$ with a fixed descent set $I$ are enumerated by a polynomial $d_I(n)$. Diaz-Lopez, Harris, Insko, Omar, and Sagan recently revived interest in this descent polynomial, and suggested the direction of studying such enumerative questions for other consecutive patterns (descents being the consecutive pattern $21$). Zhu studied this question for the consecutive pattern $321$. We continue this line of work by studying the case of any consecutive pattern of the form $k,k-1,\ldots,1$, which we call a $k$-descent. In this paper, we reduce the problem of determining the asymptotic number of permutations with a certain $k$-descent set to computing an explicit integral. We also prove an equidistribution theorem, showing that any two sparse $k$-descent sets are equally likely. Counting the number of $k$-descent-avoiding permutations while conditioning on the length $n$ and first element $m$ simultaneously, one obtains a number triangle $f_k(m,n)$ with some useful properties. For $k=3$, the $m=1$ and $m=n$ diagonals are OEIS sequences A049774 and A080635. We prove a $k$th difference recurrence relation for entries of this number triangle. This also leads to an $O(n^2)$ algorithm for computing $k$-descent functions. Along the way to these results, we prove an explicit formula for the distribution of first elements of $k$-descent-avoiding permutations, as well as for the joint distribution of first and last elements. We also develop an understanding of discrete order statistics. In our approach, we combine algebraic, analytic, and probabilistic tools. A number of open problems are stated at the end. △ Less

Submitted 29 November, 2020; originally announced November 2020.

Comments: 40 pages, 5 figures

arXiv:2009.02138 [pdf, ps, other]

Counting Signed Vexillary Permutations

Authors: Yibo Gao, Kaarel Hänni

Abstract: We show that the number of signed permutations avoiding 1234 equals the number of signed permutations avoiding 2143 (also called vexillary signed permutations), resolving a conjecture by Anderson and Fulton. The main tool that we use is the generating tree developed by West. Many further directions are mentioned in the end. We show that the number of signed permutations avoiding 1234 equals the number of signed permutations avoiding 2143 (also called vexillary signed permutations), resolving a conjecture by Anderson and Fulton. The main tool that we use is the generating tree developed by West. Many further directions are mentioned in the end. △ Less

Submitted 4 September, 2020; originally announced September 2020.

Comments: 14 pages, 6 figures. To appear in Advances in Applied Mathematics

arXiv:2007.08490 [pdf, ps, other]

Boolean elements in the Bruhat order

Authors: Yibo Gao, Kaarel Hänni

Abstract: We show that $w\in W$ is boolean if and only if it avoids a set of Billey-Postnikov patterns, which we describe explicitly. Our proof is based on an analysis of inversion sets, and it is in large part type-uniform. We also introduce the notion of linear pattern avoidance, and show that boolean elements are characterized by avoiding just the $3$ linear patterns $s_1 s_2 s_1 \in W(A_2)$,… ▽ More We show that $w\in W$ is boolean if and only if it avoids a set of Billey-Postnikov patterns, which we describe explicitly. Our proof is based on an analysis of inversion sets, and it is in large part type-uniform. We also introduce the notion of linear pattern avoidance, and show that boolean elements are characterized by avoiding just the $3$ linear patterns $s_1 s_2 s_1 \in W(A_2)$, $s_2 s_1 s_3 s_2 \in W(A_3)$, and $s_2 s_1 s_3 s_4 s_2 \in W(D_4)$. We also consider the more general case of $k$-boolean Weyl group elements. We say that $w\in W$ is $k$-boolean if every reduced expression for $w$ contains at most $k$ copies of each generator. We show that the $2$-boolean elements of the symmetric group $S_n$ are characterized by avoiding the patterns $3421,4312,4321,$ and $456123$, and give a rational generating function for the number of $2$-boolean elements of $S_n$. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Comments: 24 pages, 3 figures

MSC Class: 05E16

arXiv:2001.01149 [pdf, ps, other]

The probability of selecting $k$ edge-disjoint Hamilton cycles in the complete graph

Authors: Asaf Ferber, Kaarel Haenni, Vishesh Jain

Abstract: Let $H_1,\dots,H_k$ be Hamilton cycles in $K_n$, chosen independently and uniformly at random. We show, for $k = o(n^{1/100})$, that the probability of $H_1,\dots,H_k$ being edge-disjoint is $(1+o(1))e^{-2\binom{k}{2}}$. This extends a corresponding estimate obtained by Robbins in the case $k=2$. Let $H_1,\dots,H_k$ be Hamilton cycles in $K_n$, chosen independently and uniformly at random. We show, for $k = o(n^{1/100})$, that the probability of $H_1,\dots,H_k$ being edge-disjoint is $(1+o(1))e^{-2\binom{k}{2}}$. This extends a corresponding estimate obtained by Robbins in the case $k=2$. △ Less

Submitted 4 January, 2020; originally announced January 2020.

Comments: 8 pages

Showing 1–6 of 6 results for author: Hänni, K