-
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Authors:
Yibo Jiang,
Goutham Rajendran,
Pradeep Ravikumar,
Bryon Aragam
Abstract:
Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to…
▽ More
Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Efficient Certificates of Anti-Concentration Beyond Gaussians
Authors:
Ainesh Bakshi,
Pravesh Kothari,
Goutham Rajendran,
Madhur Tulsiani,
Aravindan Vijayaraghavan
Abstract:
A set of high dimensional points $X=\{x_1, x_2,\ldots, x_n\} \subset R^d$ in isotropic position is said to be $δ$-anti concentrated if for every direction $v$, the fraction of points in $X$ satisfying $|\langle x_i,v \rangle |\leq δ$ is at most $O(δ)$. Motivated by applications to list-decodable learning and clustering, recent works have considered the problem of constructing efficient certificate…
▽ More
A set of high dimensional points $X=\{x_1, x_2,\ldots, x_n\} \subset R^d$ in isotropic position is said to be $δ$-anti concentrated if for every direction $v$, the fraction of points in $X$ satisfying $|\langle x_i,v \rangle |\leq δ$ is at most $O(δ)$. Motivated by applications to list-decodable learning and clustering, recent works have considered the problem of constructing efficient certificates of anti-concentration in the average case, when the set of points $X$ corresponds to samples from a Gaussian distribution. Their certificates played a crucial role in several subsequent works in algorithmic robust statistics on list-decodable learning and settling the robust learnability of arbitrary Gaussian mixtures, yet remain limited to rotationally invariant distributions.
This work presents a new (and arguably the most natural) formulation for anti-concentration. Using this formulation, we give quasi-polynomial time verifiable sum-of-squares certificates of anti-concentration that hold for a wide class of non-Gaussian distributions including anti-concentrated bounded product distributions and uniform distributions over $L_p$ balls (and their affine transformations). Consequently, our method upgrades and extends results in algorithmic robust statistics e.g., list-decodable learning and clustering, to such distributions. Our approach constructs a canonical integer program for anti-concentration and analysis a sum-of-squares relaxation of it, independent of the intended application. We rely on duality and analyze a pseudo-expectation on large subsets of the input points that take a small value in some direction. Our analysis uses the method of polynomial reweightings to reduce the problem to analyzing only analytically dense or sparse directions.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
On the Origins of Linear Representations in Large Language Models
Authors:
Yibo Jiang,
Goutham Rajendran,
Pradeep Ravikumar,
Bryon Aragam,
Victor Veitch
Abstract:
Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction ob…
▽ More
Recent works have argued that high-level semantic concepts are encoded "linearly" in the representation space of large language models. In this work, we study the origins of such linear representations. To that end, we introduce a simple latent variable model to abstract and formalize the concept dynamics of the next token prediction. We use this formalism to show that the next token prediction objective (softmax with cross-entropy) and the implicit bias of gradient descent together promote the linear representation of concepts. Experiments show that linear representations emerge when learning from data matching the latent variable model, confirming that this simple structure already suffices to yield linear representations. We additionally confirm some predictions of the theory using the LLaMA-2 large language model, giving evidence that the simplified model yields generalizable insights.
△ Less
Submitted 6 March, 2024;
originally announced March 2024.
-
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Authors:
Goutham Rajendran,
Simon Buchholz,
Bryon Aragam,
Bernhard Schölkopf,
Pradeep Ravikumar
Abstract:
To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn…
▽ More
To build intelligent machine learning systems, there are two broad approaches. One approach is to build inherently interpretable models, as endeavored by the growing field of causal representation learning. The other approach is to build highly-performant foundation models and then invest efforts into understanding how they work. In this work, we relate these two approaches and study how to learn human-interpretable concepts from data. Weaving together ideas from both fields, we formally define a notion of concepts and show that they can be provably recovered from diverse data. Experiments on synthetic data and large language models show the utility of our unified approach.
△ Less
Submitted 14 February, 2024;
originally announced February 2024.
-
An Interventional Perspective on Identifiability in Gaussian LTI Systems with Independent Component Analysis
Authors:
Goutham Rajendran,
Patrik Reizinger,
Wieland Brendel,
Pradeep Ravikumar
Abstract:
We investigate the relationship between system identification and intervention design in dynamical systems. While previous research demonstrated how identifiable representation learning methods, such as Independent Component Analysis (ICA), can reveal cause-effect relationships, it relied on a passive perspective without considering how to collect data. Our work shows that in Gaussian Linear Time-…
▽ More
We investigate the relationship between system identification and intervention design in dynamical systems. While previous research demonstrated how identifiable representation learning methods, such as Independent Component Analysis (ICA), can reveal cause-effect relationships, it relied on a passive perspective without considering how to collect data. Our work shows that in Gaussian Linear Time-Invariant (LTI) systems, the system parameters can be identified by introducing diverse intervention signals in a multi-environment setting. By harnessing appropriate diversity assumptions motivated by the ICA literature, our findings connect experiment design and representational identifiability in dynamical systems. We corroborate our findings on synthetic and (simulated) physical data. Additionally, we show that Hidden Markov Models, in general, and (Gaussian) LTI systems, in particular, fulfil a generalization of the Causal de Finetti theorem with continuous parameters.
△ Less
Submitted 16 February, 2024; v1 submitted 29 November, 2023;
originally announced November 2023.
-
Performance Evaluation of Video Streaming Applications with Target Wake Time in Wi-Fi 6
Authors:
Govind Rajendran,
Rishabh Roy,
Preyas Hathi,
Nadeem Akhtar,
Samar Agnihotri
Abstract:
The Target Wake Time (TWT) feature, introduced in Wi-Fi 6, was primarily meant as an advanced power save mechanism. However, it has some interesting applications in scheduling and resource allocation. TWT-based resource allocation can be used to improve the user experience for certain applications, e.g., VoIP, IoT, video streaming, etc. In this work, we analyze the packet arrival pattern for strea…
▽ More
The Target Wake Time (TWT) feature, introduced in Wi-Fi 6, was primarily meant as an advanced power save mechanism. However, it has some interesting applications in scheduling and resource allocation. TWT-based resource allocation can be used to improve the user experience for certain applications, e.g., VoIP, IoT, video streaming, etc. In this work, we analyze the packet arrival pattern for streaming traffic and develop a synthetic video streaming traffic generator that mimics real-world streaming traffic. We propose a two-stage approach where we calculate the TWT duty cycle in the first step. In the subsequent step, we determine the Multiplication Factor(MF), which jointly dictates the required TWT schedule for the synthetic traffic model. Initial testing shows that key QoS metrics can be met for sustained performance of synthetic traffic upon enabling TWT, even in the presence of peak background congestion in the network.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Learning Linear Causal Representations from Interventions under General Nonlinear Mixing
Authors:
Simon Buchholz,
Goutham Rajendran,
Elan Rosenfeld,
Bryon Aragam,
Bernhard Schölkopf,
Pradeep Ravikumar
Abstract:
We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker cl…
▽ More
We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of causal identifiability from non-paired interventions for deep neural network embeddings. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks.
△ Less
Submitted 18 December, 2023; v1 submitted 3 June, 2023;
originally announced June 2023.
-
Integrated Architecture for Neural Networks and Security Primitives using RRAM Crossbar
Authors:
Simranjeet Singh,
Furqan Zahoor,
Gokulnath Rajendran,
Vikas Rana,
Sachin Patkar,
Anupam Chattopadhyay,
Farhad Merchant
Abstract:
This paper proposes an architecture that integrates neural networks (NNs) and hardware security modules using a single resistive random access memory (RRAM) crossbar. The proposed architecture enables using a single crossbar to implement NN, true random number generator (TRNG), and physical unclonable function (PUF) applications while exploiting the multi-state storage characteristic of the RRAM c…
▽ More
This paper proposes an architecture that integrates neural networks (NNs) and hardware security modules using a single resistive random access memory (RRAM) crossbar. The proposed architecture enables using a single crossbar to implement NN, true random number generator (TRNG), and physical unclonable function (PUF) applications while exploiting the multi-state storage characteristic of the RRAM crossbar for the vector-matrix multiplication operation required for the implementation of NN. The TRNG is implemented by utilizing the crossbar's variation in device switching thresholds to generate random bits. The PUF is implemented using the same crossbar initialized as an entropy source for the TRNG. Additionally, the weights locking concept is introduced to enhance the security of NNs by preventing unauthorized access to the NN weights. The proposed architecture provides flexibility to configure the RRAM device in multiple modes to suit different applications. It shows promise in achieving a more efficient and compact design for the hardware implementation of NNs and security primitives.
△ Less
Submitted 1 May, 2023; v1 submitted 26 April, 2023;
originally announced April 2023.
-
Sum-of-Squares Lower Bounds for Densest $k$-Subgraph
Authors:
Chris Jones,
Aaron Potechin,
Goutham Rajendran,
Jeff Xu
Abstract:
Given a graph and an integer $k$, Densest $k$-Subgraph is the algorithmic task of finding the subgraph on $k$ vertices with the maximum number of edges. This is a fundamental problem that has been subject to intense study for decades, with applications spanning a wide variety of fields. The state-of-the-art algorithm is an $O(n^{1/4 + ε})$-factor approximation (for any $ε> 0$) due to Bhaskara et a…
▽ More
Given a graph and an integer $k$, Densest $k$-Subgraph is the algorithmic task of finding the subgraph on $k$ vertices with the maximum number of edges. This is a fundamental problem that has been subject to intense study for decades, with applications spanning a wide variety of fields. The state-of-the-art algorithm is an $O(n^{1/4 + ε})$-factor approximation (for any $ε> 0$) due to Bhaskara et al. [STOC '10]. Moreover, the so-called log-density framework predicts that this is optimal, i.e. it is impossible for an efficient algorithm to achieve an $O(n^{1/4 - ε})$-factor approximation. In the average case, Densest $k$-Subgraph is a prototypical noisy inference task which is conjectured to exhibit a statistical-computational gap.
In this work, we provide the strongest evidence yet of hardness for Densest $k$-Subgraph by showing matching lower bounds against the powerful Sum-of-Squares (SoS) algorithm, a meta-algorithm based on convex programming that achieves state-of-art algorithmic guarantees for many optimization and inference problems. For $k \leq n^{\frac{1}{2}}$, we obtain a degree $n^δ$ SoS lower bound for the hard regime as predicted by the log-density framework.
To show this, we utilize the modern framework for proving SoS lower bounds on average-case problems pioneered by Barak et al. [FOCS '16]. A key issue is that small denser-than-average subgraphs in the input will greatly affect the value of the candidate pseudoexpectation operator around the subgraph. To handle this challenge, we devise a novel matrix factorization scheme based on the positive minimum vertex separator. We then prove an intersection tradeoff lemma to show that the error terms when using this separator are indeed small.
△ Less
Submitted 30 March, 2023;
originally announced March 2023.
-
Nonlinear Random Matrices and Applications to the Sum of Squares Hierarchy
Authors:
Goutham Rajendran
Abstract:
We develop new tools in the theory of nonlinear random matrices and apply them to study the performance of the Sum of Squares (SoS) hierarchy on average-case problems.
The SoS hierarchy is a powerful optimization technique that has achieved tremendous success for various problems in combinatorial optimization, robust statistics and machine learning. It's a family of convex relaxations that lets…
▽ More
We develop new tools in the theory of nonlinear random matrices and apply them to study the performance of the Sum of Squares (SoS) hierarchy on average-case problems.
The SoS hierarchy is a powerful optimization technique that has achieved tremendous success for various problems in combinatorial optimization, robust statistics and machine learning. It's a family of convex relaxations that lets us smoothly trade off running time for approximation guarantees. In recent works, it's been shown to be extremely useful for recovering structure in high dimensional noisy data. It also remains our best approach towards refuting the notorious Unique Games Conjecture.
In this work, we analyze the performance of the SoS hierarchy on fundamental problems stemming from statistics, theoretical computer science and statistical physics. In particular, we show subexponential-time SoS lower bounds for the problems of the Sherrington-Kirkpatrick Hamiltonian, Planted Slightly Denser Subgraph, Tensor Principal Components Analysis and Sparse Principal Components Analysis. These SoS lower bounds involve analyzing large random matrices, wherein lie our main contributions. These results offer strong evidence for the truth of and insight into the low-degree likelihood ratio hypothesis, an important conjecture that predicts the power of bounded-time algorithms for hypothesis testing.
We also develop general-purpose tools for analyzing the behavior of random matrices which are functions of independent random variables. Towards this, we build on and generalize the matrix variant of the Efron-Stein inequalities. In particular, our general theorem on matrix concentration recovers various results that have appeared in the literature. We expect these random matrix theory ideas to have other significant applications.
△ Less
Submitted 9 February, 2023;
originally announced February 2023.
-
Hardware Security Primitives using Passive RRAM Crossbar Array: Novel TRNG and PUF Designs
Authors:
Simranjeet Singh,
Furqan Zahoor,
Gokulnath Rajendran,
Sachin Patkar,
Anupam Chattopadhyay,
Farhad Merchant
Abstract:
With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two tech…
▽ More
With rapid advancements in electronic gadgets, the security and privacy aspects of these devices are significant. For the design of secure systems, physical unclonable function (PUF) and true random number generator (TRNG) are critical hardware security primitives for security applications. This paper proposes novel implementations of PUF and TRNGs on the RRAM crossbar structure. Firstly, two techniques to implement the TRNG in the RRAM crossbar are presented based on write-back and 50% switching probability pulse. The randomness of the proposed TRNGs is evaluated using the NIST test suite. Next, an architecture to implement the PUF in the RRAM crossbar is presented. The initial entropy source for the PUF is used from TRNGs, and challenge-response pairs (CRPs) are collected. The proposed PUF exploits the device variations and sneak-path current to produce unique CRPs. We demonstrate, through extensive experiments, reliability of 100%, uniqueness of 47.78%, uniformity of 49.79%, and bit-aliasing of 48.57% without any post-processing techniques. Finally, the design is compared with the literature to evaluate its implementation efficiency, which is clearly found to be superior to the state-of-the-art.
△ Less
Submitted 7 November, 2022;
originally announced November 2022.
-
Concentration of polynomial random matrices via Efron-Stein inequalities
Authors:
Goutham Rajendran,
Madhur Tulsiani
Abstract:
Analyzing concentration of large random matrices is a common task in a wide variety of fields. Given independent random variables, many tools are available to analyze random matrices whose entries are linear in the variables, e.g. the matrix-Bernstein inequality. However, in many applications, we need to analyze random matrices whose entries are polynomials in the variables. These arise naturally…
▽ More
Analyzing concentration of large random matrices is a common task in a wide variety of fields. Given independent random variables, many tools are available to analyze random matrices whose entries are linear in the variables, e.g. the matrix-Bernstein inequality. However, in many applications, we need to analyze random matrices whose entries are polynomials in the variables. These arise naturally in the analysis of spectral algorithms, e.g., Hopkins et al. [STOC 2016], Moitra-Wein [STOC 2019]; and in lower bounds for semidefinite programs based on the Sum of Squares hierarchy, e.g. Barak et al. [FOCS 2016], Jones et al. [FOCS 2021]. In this work, we present a general framework to obtain such bounds, based on the matrix Efron-Stein inequalities developed by Paulin-Mackey-Tropp [Annals of Probability 2016]. The Efron-Stein inequality bounds the norm of a random matrix by the norm of another simpler (but still random) matrix, which we view as arising by "differentiating" the starting matrix. By recursively differentiating, our framework reduces the main task to analyzing far simpler matrices. For Rademacher variables, these simpler matrices are in fact deterministic and hence, analyzing them is far easier. For general non-Rademacher variables, the task reduces to scalar concentration, which is much easier. Moreover, in the setting of polynomial matrices, our results generalize the work of Paulin-Mackey-Tropp. Using our basic framework, we recover known bounds in the literature for simple "tensor networks" and "dense graph matrices". Using our general framework, we derive bounds for "sparse graph matrices", which were obtained only recently by Jones et al. [FOCS 2021] using a nontrivial application of the trace power method, and was a core component in their work. We expect our framework to be helpful for other applications involving concentration phenomena for nonlinear random matrices.
△ Less
Submitted 17 January, 2023; v1 submitted 6 September, 2022;
originally announced September 2022.
-
Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition
Authors:
Goutham Rajendran,
Wei Zou
Abstract:
We investigate robustness properties of pre-trained neural models for automatic speech recognition. Real life data in machine learning is usually very noisy and almost never clean, which can be attributed to various factors depending on the domain, e.g. outliers, random noise and adversarial noise. Therefore, the models we develop for various tasks should be robust to such kinds of noisy data, whi…
▽ More
We investigate robustness properties of pre-trained neural models for automatic speech recognition. Real life data in machine learning is usually very noisy and almost never clean, which can be attributed to various factors depending on the domain, e.g. outliers, random noise and adversarial noise. Therefore, the models we develop for various tasks should be robust to such kinds of noisy data, which led to the thriving field of robust machine learning. We consider this important issue in the setting of automatic speech recognition. With the increasing popularity of pre-trained models, it's an important question to analyze and understand the robustness of such models to noise. In this work, we perform a robustness analysis of the pre-trained neural models wav2vec2, HuBERT and DistilHuBERT on the LibriSpeech and TIMIT datasets. We use different kinds of noising mechanisms and measure the model performances as quantified by the inference time and the standard Word Error Rate metric. We also do an in-depth layer-wise analysis of the wav2vec2 model when injecting noise in between layers, enabling us to predict at a high level what each layer learns. Finally for this model, we visualize the propagation of errors across the layers and compare how it behaves on clean versus noisy data. Our experiments conform the predictions of Pasad et al. [2021] and also raise interesting directions for future work.
△ Less
Submitted 17 August, 2022;
originally announced August 2022.
-
Combinatorial Optimization via the Sum of Squares Hierarchy
Authors:
Goutham Rajendran
Abstract:
We study the Sum of Squares (SoS) Hierarchy with a view towards combinatorial optimization. We survey the use of the SoS hierarchy to obtain approximation algorithms on graphs using their spectral properties. We present a simplified proof of the result of Feige and Krauthgamer on the performance of the hierarchy for the Maximum Clique problem on random graphs. We also present a result of Guruswami…
▽ More
We study the Sum of Squares (SoS) Hierarchy with a view towards combinatorial optimization. We survey the use of the SoS hierarchy to obtain approximation algorithms on graphs using their spectral properties. We present a simplified proof of the result of Feige and Krauthgamer on the performance of the hierarchy for the Maximum Clique problem on random graphs. We also present a result of Guruswami and Sinop that shows how to obtain approximation algorithms for the Minimum Bisection problem on low threshold-rank graphs.
We study inapproximability results for the SoS hierarchy for general constraint satisfaction problems and problems involving graph densities such as the Densest $k$-subgraph problem. We improve the existing inapproximability results for general constraint satisfaction problems in the case of large arity, using stronger probabilistic analyses of expansion of random instances. We examine connections between constraint satisfaction problems and density problems on graphs. Using them, we obtain new inapproximability results for the hierarchy for the Densest $k$-subhypergraph problem and the Minimum $p$-Union problem, which are proven via reductions.
We also illustrate the relatively new idea of pseudocalibration to construct integrality gaps for the SoS hierarchy for Maximum Clique and Max $K$-CSP. The application to Max $K$-CSP that we present is known in the community but has not been presented before in the literature, to the best of our knowledge.
△ Less
Submitted 1 September, 2022; v1 submitted 8 August, 2022;
originally announced August 2022.
-
Identifiability of deep generative models without auxiliary information
Authors:
Bohdan Kivva,
Goutham Rajendran,
Pradeep Ravikumar,
Bryon Aragam
Abstract:
We prove identifiability of a broad class of deep latent variable models that (a) have universal approximation capabilities and (b) are the decoders of variational autoencoders that are commonly used in practice. Unlike existing work, our analysis does not require weak supervision, auxiliary information, or conditioning in the latent space. Specifically, we show that for a broad class of generativ…
▽ More
We prove identifiability of a broad class of deep latent variable models that (a) have universal approximation capabilities and (b) are the decoders of variational autoencoders that are commonly used in practice. Unlike existing work, our analysis does not require weak supervision, auxiliary information, or conditioning in the latent space. Specifically, we show that for a broad class of generative (i.e. unsupervised) models with universal approximation capabilities, the side information $u$ is not necessary: We prove identifiability of the entire generative model where we do not observe $u$ and only observe the data $x$. The models we consider match autoencoder architectures used in practice that leverage mixture priors in the latent space and ReLU/leaky-ReLU activations in the encoder, such as VaDE and MFC-VAE. Our main result is an identifiability hierarchy that significantly generalizes previous work and exposes how different assumptions lead to different "strengths" of identifiability, and includes certain "vanilla" VAEs with isotropic Gaussian priors as a special case. For example, our weakest result establishes (unsupervised) identifiability up to an affine transformation, and thus partially resolves an open problem regarding model identifiability raised in prior work. These theoretical results are augmented with experiments on both simulated and real data.
△ Less
Submitted 18 October, 2022; v1 submitted 20 June, 2022;
originally announced June 2022.
-
Sum-of-Squares Lower Bounds for Sparse Independent Set
Authors:
Chris Jones,
Aaron Potechin,
Goutham Rajendran,
Madhur Tulsiani,
Jeff Xu
Abstract:
The Sum-of-Squares (SoS) hierarchy of semidefinite programs is a powerful algorithmic paradigm which captures state-of-the-art algorithmic guarantees for a wide array of problems. In the average case setting, SoS lower bounds provide strong evidence of algorithmic hardness or information-computation gaps. Prior to this work, SoS lower bounds have been obtained for problems in the "dense" input reg…
▽ More
The Sum-of-Squares (SoS) hierarchy of semidefinite programs is a powerful algorithmic paradigm which captures state-of-the-art algorithmic guarantees for a wide array of problems. In the average case setting, SoS lower bounds provide strong evidence of algorithmic hardness or information-computation gaps. Prior to this work, SoS lower bounds have been obtained for problems in the "dense" input regime, where the input is a collection of independent Rademacher or Gaussian random variables, while the sparse regime has remained out of reach. We make the first progress in this direction by obtaining strong SoS lower bounds for the problem of Independent Set on sparse random graphs. We prove that with high probability over an Erdos-Renyi random graph $G\sim G_{n,\frac{d}{n}}$ with average degree $d>\log^2 n$, degree-$D_{SoS}$ SoS fails to refute the existence of an independent set of size $k = Ω\left(\frac{n}{\sqrt{d}(\log n)(D_{SoS})^{c_0}} \right)$ in $G$ (where $c_0$ is an absolute constant), whereas the true size of the largest independent set in $G$ is $O\left(\frac{n\log d}{d}\right)$.
Our proof involves several significant extensions of the techniques used for proving SoS lower bounds in the dense setting. Previous lower bounds are based on the pseudo-calibration heuristic of Barak et al [FOCS 2016] which produces a candidate SoS solution using a planted distribution indistinguishable from the input distribution via low-degree tests. In the sparse case the natural planted distribution does admit low-degree distinguishers, and we show how to adapt the pseudo-calibration heuristic to overcome this.
Another notorious technical challenge for the sparse regime is the quest for matrix norm bounds. In this paper, we obtain new norm bounds for graph matrices in the sparse setting.
△ Less
Submitted 17 November, 2021;
originally announced November 2021.
-
Structure learning in polynomial time: Greedy algorithms, Bregman information, and exponential families
Authors:
Goutham Rajendran,
Bohdan Kivva,
Ming Gao,
Bryon Aragam
Abstract:
Greedy algorithms have long been a workhorse for learning graphical models, and more broadly for learning statistical models with sparse structure. In the context of learning directed acyclic graphs, greedy algorithms are popular despite their worst-case exponential runtime. In practice, however, they are very efficient. We provide new insight into this phenomenon by studying a general greedy scor…
▽ More
Greedy algorithms have long been a workhorse for learning graphical models, and more broadly for learning statistical models with sparse structure. In the context of learning directed acyclic graphs, greedy algorithms are popular despite their worst-case exponential runtime. In practice, however, they are very efficient. We provide new insight into this phenomenon by studying a general greedy score-based algorithm for learning DAGs. Unlike edge-greedy algorithms such as the popular GES and hill-climbing algorithms, our approach is vertex-greedy and requires at most a polynomial number of score evaluations. We then show how recent polynomial-time algorithms for learning DAG models are a special case of this algorithm, thereby illustrating how these order-based algorithms can be rigourously interpreted as score-based algorithms. This observation suggests new score functions and optimality conditions based on the duality between Bregman divergences and exponential families, which we explore in detail. Explicit sample and computational complexity bounds are derived. Finally, we provide extensive experiments suggesting that this algorithm indeed optimizes the score in a variety of settings.
△ Less
Submitted 28 October, 2021; v1 submitted 10 October, 2021;
originally announced October 2021.
-
Learning latent causal graphs via mixture oracles
Authors:
Bohdan Kivva,
Goutham Rajendran,
Pradeep Ravikumar,
Bryon Aragam
Abstract:
We study the problem of reconstructing a causal graphical model from data in the presence of latent variables. The main problem of interest is recovering the causal structure over the latent variables while allowing for general, potentially nonlinear dependence between the variables. In many practical problems, the dependence between raw observations (e.g. pixels in an image) is much less relevant…
▽ More
We study the problem of reconstructing a causal graphical model from data in the presence of latent variables. The main problem of interest is recovering the causal structure over the latent variables while allowing for general, potentially nonlinear dependence between the variables. In many practical problems, the dependence between raw observations (e.g. pixels in an image) is much less relevant than the dependence between certain high-level, latent features (e.g. concepts or objects), and this is the setting of interest. We provide conditions under which both the latent representations and the underlying latent causal model are identifiable by a reduction to a mixture oracle. These results highlight an intriguing connection between the well-studied problem of learning the order of a mixture model and the problem of learning the bipartite structure between observables and unobservables. The proof is constructive, and leads to several algorithms for explicitly reconstructing the full graphical model. We discuss efficient algorithms and provide experiments illustrating the algorithms in practice.
△ Less
Submitted 21 November, 2021; v1 submitted 29 June, 2021;
originally announced June 2021.
-
Machinery for Proving Sum-of-Squares Lower Bounds on Certification Problems
Authors:
Aaron Potechin,
Goutham Rajendran
Abstract:
In this paper, we construct general machinery for proving Sum-of-Squares lower bounds on certification problems by generalizing the techniques used by Barak et al. [FOCS 2016] to prove Sum-of-Squares lower bounds for planted clique. Using this machinery, we prove degree $n^ε$ Sum-of-Squares lower bounds for tensor PCA, the Wishart model of sparse PCA, and a variant of planted clique which we call…
▽ More
In this paper, we construct general machinery for proving Sum-of-Squares lower bounds on certification problems by generalizing the techniques used by Barak et al. [FOCS 2016] to prove Sum-of-Squares lower bounds for planted clique. Using this machinery, we prove degree $n^ε$ Sum-of-Squares lower bounds for tensor PCA, the Wishart model of sparse PCA, and a variant of planted clique which we call planted slightly denser subgraph.
△ Less
Submitted 9 February, 2023; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Sum-of-Squares Lower Bounds for Sherrington-Kirkpatrick via Planted Affine Planes
Authors:
Mrinalkanti Ghosh,
Fernando Granha Jeronimo,
Chris Jones,
Aaron Potechin,
Goutham Rajendran
Abstract:
The Sum-of-Squares (SoS) hierarchy is a semi-definite programming meta-algorithm that captures state-of-the-art polynomial time guarantees for many optimization problems such as Max-$k$-CSPs and Tensor PCA. On the flip side, a SoS lower bound provides evidence of hardness, which is particularly relevant to average-case problems for which NP-hardness may not be available.
In this paper, we consid…
▽ More
The Sum-of-Squares (SoS) hierarchy is a semi-definite programming meta-algorithm that captures state-of-the-art polynomial time guarantees for many optimization problems such as Max-$k$-CSPs and Tensor PCA. On the flip side, a SoS lower bound provides evidence of hardness, which is particularly relevant to average-case problems for which NP-hardness may not be available.
In this paper, we consider the following average case problem, which we call the \emph{Planted Affine Planes} (PAP) problem: Given $m$ random vectors $d_1,\ldots,d_m$ in $\mathbb{R}^n$, can we prove that there is no vector $v \in \mathbb{R}^n$ such that for all $u \in [m]$, $\langle v, d_u\rangle^2 = 1$? In other words, can we prove that $m$ random vectors are not all contained in two parallel hyperplanes at equal distance from the origin? We prove that for $m \leq n^{3/2-ε}$, with high probability, degree-$n^{Ω(ε)}$ SoS fails to refute the existence of such a vector $v$.
When the vectors $d_1,\ldots,d_m$ are chosen from the multivariate normal distribution, the PAP problem is equivalent to the problem of proving that a random $n$-dimensional subspace of $\mathbb{R}^m$ does not contain a boolean vector. As shown by Mohanty--Raghavendra--Xu [STOC 2020], a lower bound for this problem implies a lower bound for the problem of certifying energy upper bounds on the Sherrington-Kirkpatrick Hamiltonian, and so our lower bound implies a degree-$n^{Ω(ε)}$ SoS lower bound for the certification version of the Sherrington-Kirkpatrick problem.
△ Less
Submitted 3 September, 2020;
originally announced September 2020.