-
Towards an Improved Understanding and Utilization of Maximum Manifold Capacity Representations
Authors:
Rylan Schaeffer,
Victor Lecomte,
Dhruv Bhandarkar Pai,
Andres Carranza,
Berivan Isik,
Alyssa Unell,
Mikail Khona,
Thomas Yerxa,
Yann LeCun,
SueYeon Chung,
Andrey Gromov,
Ravid Shwartz-Ziv,
Sanmi Koyejo
Abstract:
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to impro…
▽ More
Maximum Manifold Capacity Representations (MMCR) is a recent multi-view self-supervised learning (MVSSL) method that matches or surpasses other leading MVSSL methods. MMCR is intriguing because it does not fit neatly into any of the commonplace MVSSL lineages, instead originating from a statistical mechanical perspective on the linear separability of data manifolds. In this paper, we seek to improve our understanding and our utilization of MMCR. To better understand MMCR, we leverage tools from high dimensional probability to demonstrate that MMCR incentivizes alignment and uniformity of learned embeddings. We then leverage tools from information theory to show that such embeddings maximize a well-known lower bound on mutual information between views, thereby connecting the geometric perspective of MMCR to the information-theoretic perspective commonly discussed in MVSSL. To better utilize MMCR, we mathematically predict and experimentally confirm non-monotonic changes in the pretraining loss akin to double descent but with respect to atypical hyperparameters. We also discover compute scaling laws that enable predicting the pretraining loss as a function of gradients steps, batch size, embedding dimension and number of views. We then show that MMCR, originally applied to image data, is performant on multimodal image-text data. By more deeply understanding the theoretical and empirical behavior of MMCR, our work reveals insights on improving MVSSL methods.
△ Less
Submitted 13 June, 2024;
originally announced June 2024.
-
What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes
Authors:
Victor Lecomte,
Kushal Thaman,
Rylan Schaeffer,
Naomi Bashkansky,
Trevor Chow,
Sanmi Koyejo
Abstract:
Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more ``features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrela…
▽ More
Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more ``features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand networks' internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, a phenomenon we term \textit{incidental polysemanticity}. Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Our paper concludes by calling for further research quantifying the performance-polysemanticity tradeoff in task-optimized deep neural networks to better understand to what extent polysemanticity is avoidable.
△ Less
Submitted 13 February, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
The composition complexity of majority
Authors:
Victor Lecomte,
Prasanna Ramakrishnan,
Li-Yang Tan
Abstract:
We study the complexity of computing majority as a composition of local functions: \[ \text{Maj}_n = h(g_1,\ldots,g_m), \] where each $g_j :\{0,1\}^{n} \to \{0,1\}$ is an arbitrary function that queries only $k \ll n$ variables and $h : \{0,1\}^m \to \{0,1\}$ is an arbitrary combining function. We prove an optimal lower bound of \[ m \ge Ω\left( \frac{n}{k} \log k \right) \] on the number of funct…
▽ More
We study the complexity of computing majority as a composition of local functions: \[ \text{Maj}_n = h(g_1,\ldots,g_m), \] where each $g_j :\{0,1\}^{n} \to \{0,1\}$ is an arbitrary function that queries only $k \ll n$ variables and $h : \{0,1\}^m \to \{0,1\}$ is an arbitrary combining function. We prove an optimal lower bound of \[ m \ge Ω\left( \frac{n}{k} \log k \right) \] on the number of functions needed, which is a factor $Ω(\log k)$ larger than the ideal $m = n/k$. We call this factor the composition overhead; previously, no superconstant lower bounds on it were known for majority.
Our lower bound recovers, as a corollary and via an entirely different proof, the best known lower bound for bounded-width branching programs for majority (Alon and Maass '86, Babai et al. '90). It is also the first step in a plan that we propose for breaking a longstanding barrier in lower bounds for small-depth boolean circuits.
Novel aspects of our proof include sharp bounds on the information lost as computation flows through the inner functions $g_j$, and the bootstrap** of lower bounds for a multi-output function (Hamming weight) into lower bounds for a single-output one (majority).
△ Less
Submitted 16 May, 2022; v1 submitted 4 May, 2022;
originally announced May 2022.
-
Sharper bounds on the Fourier concentration of DNFs
Authors:
Victor Lecomte,
Li-Yang Tan
Abstract:
In 1992 Mansour proved that every size-$s$ DNF formula is Fourier-concentrated on $s^{O(\log\log s)}$ coefficients. We improve this to $s^{O(\log\log k)}$ where $k$ is the read number of the DNF. Since $k$ is always at most $s$, our bound matches Mansour's for all DNFs and strengthens it for small-read ones. The previous best bound for read-$k$ DNFs was $s^{O(k^{3/2})}$. For $k$ up to…
▽ More
In 1992 Mansour proved that every size-$s$ DNF formula is Fourier-concentrated on $s^{O(\log\log s)}$ coefficients. We improve this to $s^{O(\log\log k)}$ where $k$ is the read number of the DNF. Since $k$ is always at most $s$, our bound matches Mansour's for all DNFs and strengthens it for small-read ones. The previous best bound for read-$k$ DNFs was $s^{O(k^{3/2})}$. For $k$ up to $\tildeΘ(\log\log s)$, we further improve our bound to the optimal $\mathrm{poly}(s)$; previously no such bound was known for any $k = ω_s(1)$.
Our techniques involve new connections between the term structure of a DNF, viewed as a set system, and its Fourier spectrum.
△ Less
Submitted 15 October, 2021; v1 submitted 9 September, 2021;
originally announced September 2021.
-
The power of adaptivity in source identification with time queries on the path
Authors:
Victor Lecomte,
Gergely Ódor,
Patrick Thiran
Abstract:
We study the problem of identifying the source of a stochastic diffusion process spreading on a graph based on the arrival times of the diffusion at a few queried nodes. In a graph $G=(V,E)$, an unknown source node $v^* \in V$ is drawn uniformly at random, and unknown edge weights $w(e)$ for $e\in E$, representing the propagation delays along the edges, are drawn independently from a Gaussian dist…
▽ More
We study the problem of identifying the source of a stochastic diffusion process spreading on a graph based on the arrival times of the diffusion at a few queried nodes. In a graph $G=(V,E)$, an unknown source node $v^* \in V$ is drawn uniformly at random, and unknown edge weights $w(e)$ for $e\in E$, representing the propagation delays along the edges, are drawn independently from a Gaussian distribution of mean $1$ and variance $σ^2$. An algorithm then attempts to identify $v^*$ by querying nodes $q \in V$ and being told the length of the shortest path between $q$ and $v^*$ in graph $G$ weighted by $w$. We consider two settings: non-adaptive, in which all query nodes must be decided in advance, and adaptive, in which each query can depend on the results of the previous ones. Both settings are motivated by an application of the problem to epidemic processes (where the source is called patient zero), which we discuss in detail.
We characterize the query complexity when $G$ is an $n$-node path. In the non-adaptive setting, $Θ(nσ^2)$ queries are needed for $σ^2 \leq 1$, and $Θ(n)$ for $σ^2 \geq 1$. In the adaptive setting, somewhat surprisingly, only $Θ(\log\log_{1/σ}n)$ are needed when $σ^2 \leq 1/2$, and $Θ(\log \log n)+O_σ(1)$ when $σ^2 \geq 1/2$. This is the first mathematical study of source identification with time queries in a non-deterministic diffusion process.
△ Less
Submitted 29 December, 2021; v1 submitted 17 February, 2020;
originally announced February 2020.
-
Settling the relationship between Wilber's bounds for dynamic optimality
Authors:
Victor Lecomte,
Omri Weinstein
Abstract:
In FOCS 1986, Wilber proposed two combinatorial lower bounds on the operational cost of any binary search tree (BST) for a given access sequence $X \in [n]^m$. Both bounds play a central role in the ongoing pursuit of the dynamic optimality conjecture (Sleator and Tarjan, 1985), but their relationship remained unknown for more than three decades. We show that Wilber's Funnel bound dominates his Al…
▽ More
In FOCS 1986, Wilber proposed two combinatorial lower bounds on the operational cost of any binary search tree (BST) for a given access sequence $X \in [n]^m$. Both bounds play a central role in the ongoing pursuit of the dynamic optimality conjecture (Sleator and Tarjan, 1985), but their relationship remained unknown for more than three decades. We show that Wilber's Funnel bound dominates his Alternation bound for all $X$, and give a tight $Θ(\lg\lg n)$ separation for some $X$, answering Wilber's conjecture and an open problem of Iacono, Demaine et. al. The main ingredient of the proof is a new "symmetric" characterization of Wilber's Funnel bound, which proves that it is invariant under rotations of $X$. We use this characterization to provide initial indication that the Funnel bound matches the Independent Rectangle bound (Demaine et al., 2009), by proving that when the Funnel bound is constant, $\mathsf{IRB}_{\diagup\hspace{-.6em}\square}$ is linear. To the best of our knowledge, our results provide the first progress on Wilber's conjecture that the Funnel bound is dynamically optimal (1986).
△ Less
Submitted 28 June, 2020; v1 submitted 5 December, 2019;
originally announced December 2019.