Search | arXiv e-print repository

doi 10.1109/TSP.2022.3156702

Self-Regularity of Non-Negative Output Weights for Overparameterized Two-Layer Neural Networks

Authors: David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

Abstract: We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmativel… ▽ More We consider the problem of finding a two-layer neural network with sigmoid, rectified linear unit (ReLU), or binary step activation functions that "fits" a training data set as accurately as possible as quantified by the training error; and study the following question: \emph{does a low training error guarantee that the norm of the output layer (outer norm) itself is small?} We answer affirmatively this question for the case of non-negative output weights. Using a simple covering number argument, we establish that under quite mild distributional assumptions on the input/label pairs; any such network achieving a small training error on polynomially many data necessarily has a well-controlled outer norm. Notably, our results (a) have a polynomial (in $d$) sample complexity, (b) are independent of the number of hidden units (which can potentially be very high), (c) are oblivious to the training algorithm; and (d) require quite mild assumptions on the data (in particular the input vector $X\in\mathbb{R}^d$ need not have independent coordinates). We then leverage our bounds to establish generalization guarantees for such networks through \emph{fat-shattering dimension}, a scale-sensitive measure of the complexity class that the network architectures we investigate belong to. Notably, our generalization bounds also have good sample complexity (polynomials in $d$ with a low degree), and are in fact near-linear for some important cases of interest. △ Less

Submitted 2 March, 2021; originally announced March 2021.

Comments: 34 pages. Some of the results in the present paper are significantly strengthened versions of certain results appearing in arXiv:2003.10523

arXiv:2004.12063 [pdf, ps, other]

Hardness of Random Optimization Problems for Boolean Circuits, Low-Degree Polynomials, and Langevin Dynamics

Authors: David Gamarnik, Aukosh Jagannath, Alexander S. Wein

Abstract: We consider the problem of finding nearly optimal solutions of optimization problems with random objective functions. Two concrete problems we consider are (a) optimizing the Hamiltonian of a spherical or Ising $p$-spin glass model, and (b) finding a large independent set in a sparse Erdős-Rényi graph. The following families of algorithms are considered: (a) low-degree polynomials of the input; (b… ▽ More We consider the problem of finding nearly optimal solutions of optimization problems with random objective functions. Two concrete problems we consider are (a) optimizing the Hamiltonian of a spherical or Ising $p$-spin glass model, and (b) finding a large independent set in a sparse Erdős-Rényi graph. The following families of algorithms are considered: (a) low-degree polynomials of the input; (b) low-depth Boolean circuits; (c) the Langevin dynamics algorithm. We show that these families of algorithms fail to produce nearly optimal solutions with high probability. For the case of Boolean circuits, our results improve the state-of-the-art bounds known in circuit complexity theory (although we consider the search problem as opposed to the decision problem). Our proof uses the fact that these models are known to exhibit a variant of the overlap gap property (OGP) of near-optimal solutions. Specifically, for both models, every two solutions whose objectives are above a certain threshold are either close or far from each other. The crux of our proof is that the classes of algorithms we consider exhibit a form of stability. We show by an interpolation argument that stable algorithms cannot overcome the OGP barrier. The stability of Langevin dynamics is an immediate consequence of the well-posedness of stochastic differential equations. The stability of low-degree polynomials and Boolean circuits is established using tools from Gaussian and Boolean analysis -- namely hypercontractivity and total influence, as well as a novel lower bound for random walks avoiding certain subsets. In the case of Boolean circuits, the result also makes use of Linal-Mansour-Nisan's classical theorem. Our techniques apply more broadly to low influence functions and may apply more generally. △ Less

Submitted 26 January, 2022; v1 submitted 25 April, 2020; originally announced April 2020.

Comments: 41 pages; v1 is the conference paper "Low-Degree Hardness of Random Optimization Problems" (FOCS 2020); v2 is a journal version which adds circuit lower bounds for max independent set, based on ideas from our note arXiv:2109.01342

arXiv:2003.10523 [pdf, other]

Neural Networks and Polynomial Regression. Demystifying the Overparametrization Phenomena

Authors: Matt Emschwiller, David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

Abstract: In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the… ▽ More In the context of neural network models, overparametrization refers to the phenomena whereby these models appear to generalize well on the unseen data, even though the number of parameters significantly exceeds the sample sizes, and the model perfectly fits the in-training data. A conventional explanation of this phenomena is based on self-regularization properties of algorithms used to train the data. In this paper we prove a series of results which provide a somewhat diverging explanation. Adopting a teacher/student model where the teacher network is used to generate the predictions and student network is trained on the observed labeled data, and then tested on out-of-sample data, we show that any student network interpolating the data generated by a teacher network generalizes well, provided that the sample size is at least an explicit quantity controlled by data dimension and approximation guarantee alone, regardless of the number of internal nodes of either teacher or student network. Our claim is based on approximating both teacher and student networks by polynomial (tensor) regression models with degree depending on the desired accuracy and network depth only. Such a parametrization notably does not depend on the number of internal nodes. Thus a message implied by our results is that parametrizing wide neural networks by the number of hidden nodes is misleading, and a more fitting measure of parametrization complexity is the number of regression coefficients associated with tensorized data. In particular, this somewhat reconciles the generalization ability of neural networks with more classical statistical notions of data complexity and generalization bounds. Our empirical results on MNIST and Fashion-MNIST datasets indeed confirm that tensorized regression achieves a good out-of-sample performance, even when the degree of the tensor is at most two. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: 59 pages, 3 figures

arXiv:1912.01599 [pdf, ps, other]

Stationary Points of Shallow Neural Networks with Quadratic Activation Function

Authors: David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

Abstract: We consider the teacher-student setting of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\le m$ is the data dimension. We study the optimization landscape associated with the empirical and the population squared risk of the problem. Under the assumption the planted weights are f… ▽ More We consider the teacher-student setting of learning shallow neural networks with quadratic activations and planted weight matrix $W^*\in\mathbb{R}^{m\times d}$, where $m$ is the width of the hidden layer and $d\le m$ is the data dimension. We study the optimization landscape associated with the empirical and the population squared risk of the problem. Under the assumption the planted weights are full-rank we obtain the following results. First, we establish that the landscape of the empirical risk admits an "energy barrier" separating rank-deficient $W$ from $W^*$: if $W$ is rank deficient, then its risk is bounded away from zero by an amount we quantify. We then couple this result by showing that, assuming number $N$ of samples grows at least like a polynomial function of $d$, all full-rank approximate stationary points of the empirical risk are nearly global optimum. These two results allow us to prove that gradient descent, when initialized below the energy barrier, approximately minimizes the empirical risk and recovers the planted weights in polynomial-time. Next, we show that initializing below this barrier is in fact easily achieved when the weights are randomly generated under relatively weak assumptions. We show that provided the network is sufficiently overparametrized, initializing with an appropriate multiple of the identity suffices to obtain a risk below the energy barrier. At a technical level, the last result is a consequence of the semicircle law for the Wishart ensemble and could be of independent interest. Finally, we study the minimizers of the empirical risk and identify a simple necessary and sufficient geometric condition on the training data under which any minimizer has necessarily zero generalization error. We show that as soon as $N\ge N^*=d(d+1)/2$, randomly generated data enjoys this geometric condition almost surely, while that ceases to be true if $N<N^*$. △ Less

Submitted 9 July, 2020; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: 54 pages

arXiv:1910.10890 [pdf, other]

doi 10.1109/TIT.2021.3113921

Inference in High-Dimensional Linear Regression via Lattice Basis Reduction and Integer Relation Detection

Authors: David Gamarnik, Eren C. Kızıldağ, Ilias Zadik

Abstract: We focus on the high-dimensional linear regression problem, where the algorithmic goal is to efficiently infer an unknown feature vector $β^*\in\mathbb{R}^p$ from its linear measurements, using a small number $n$ of samples. Unlike most of the literature, we make no sparsity assumption on $β^*$, but instead adopt a different regularization: In the noiseless setting, we assume $β^*$ consists of ent… ▽ More We focus on the high-dimensional linear regression problem, where the algorithmic goal is to efficiently infer an unknown feature vector $β^*\in\mathbb{R}^p$ from its linear measurements, using a small number $n$ of samples. Unlike most of the literature, we make no sparsity assumption on $β^*$, but instead adopt a different regularization: In the noiseless setting, we assume $β^*$ consists of entries, which are either rational numbers with a common denominator $Q\in\mathbb{Z}^+$ (referred to as $Q$-rationality); or irrational numbers supported on a rationally independent set of bounded cardinality, known to learner; collectively called as the mixed-support assumption. Using a novel combination of the PSLQ integer relation detection, and LLL lattice basis reduction algorithms, we propose a polynomial-time algorithm which provably recovers a $β^*\in\mathbb{R}^p$ enjoying the mixed-support assumption, from its linear measurements $Y=Xβ^*\in\mathbb{R}^n$ for a large class of distributions for the random entries of $X$, even with one measurement $(n=1)$. In the noisy setting, we propose a polynomial-time, lattice-based algorithm, which recovers a $β^*\in\mathbb{R}^p$ enjoying $Q$-rationality, from its noisy measurements $Y=Xβ^*+W\in\mathbb{R}^n$, even with a single sample $(n=1)$. We further establish for large $Q$, and normal noise, this algorithm tolerates information-theoretically optimal level of noise. We then apply these ideas to develop a polynomial-time, single-sample algorithm for the phase retrieval problem. Our methods address the single-sample $(n=1)$ regime, where the sparsity-based methods such as LASSO and Basis Pursuit are known to fail. Furthermore, our results also reveal an algorithmic connection between the high-dimensional linear regression problem, and the integer relation detection, randomized subset-sum, and shortest vector problems. △ Less

Submitted 23 October, 2019; originally announced October 2019.

Comments: 56 pages. Parts of the material of this manuscript were presented at NeurIPS 2018, and ISIT 2019. This submission subsumes the content of arXiv:1803.06716

Journal ref: IEEE Transactions on Information Theory (Volume: 67, Issue: 12, December 2021)

arXiv:1805.11238 [pdf, ps, other]

Explicit construction of RIP matrices is Ramsey-hard

Authors: David Gamarnik

Abstract: Matrices $Φ\in\R^{n\times p}$ satisfying the Restricted Isometry Property (RIP) are an important ingredient of the compressive sensing methods. While it is known that random matrices satisfy the RIP with high probability even for $n=\log^{O(1)}p$, the explicit construction of such matrices defied the repeated efforts, and the most known approaches hit the so-called $\sqrt{n}$ sparsity bottleneck.… ▽ More Matrices $Φ\in\R^{n\times p}$ satisfying the Restricted Isometry Property (RIP) are an important ingredient of the compressive sensing methods. While it is known that random matrices satisfy the RIP with high probability even for $n=\log^{O(1)}p$, the explicit construction of such matrices defied the repeated efforts, and the most known approaches hit the so-called $\sqrt{n}$ sparsity bottleneck. The notable exception is the work by Bourgain et al \cite{bourgain2011explicit} constructing an $n\times p$ RIP matrix with sparsity $s=Θ(n^{{1\over 2}+ε})$, but in the regime $n=Ω(p^{1-δ})$. In this short note we resolve this open question in a sense by showing that an explicit construction of a matrix satisfying the RIP in the regime $n=O(\log^2 p)$ and $s=Θ(n^{1\over 2})$ implies an explicit construction of a three-colored Ramsey graph on $p$ nodes with clique sizes bounded by $O(\log^2 p)$ -- a question in the extremal combinatorics which has been open for decades. △ Less

Submitted 15 November, 2018; v1 submitted 29 May, 2018; originally announced May 2018.

Comments: 4 pages

arXiv:1803.06716 [pdf, other]

High Dimensional Linear Regression using Lattice Basis Reduction

Authors: David Gamarnik, Ilias Zadik

Abstract: We consider a high dimensional linear regression problem where the goal is to efficiently recover an unknown vector $β^*$ from $n$ noisy linear observations $Y=Xβ^*+W \in \mathbb{R}^n$, for known $X \in \mathbb{R}^{n \times p}$ and unknown $W \in \mathbb{R}^n$. Unlike most of the literature on this model we make no sparsity assumption on $β^*$. Instead we adopt a regularization based on assuming t… ▽ More We consider a high dimensional linear regression problem where the goal is to efficiently recover an unknown vector $β^*$ from $n$ noisy linear observations $Y=Xβ^*+W \in \mathbb{R}^n$, for known $X \in \mathbb{R}^{n \times p}$ and unknown $W \in \mathbb{R}^n$. Unlike most of the literature on this model we make no sparsity assumption on $β^*$. Instead we adopt a regularization based on assuming that the underlying vectors $β^*$ have rational entries with the same denominator $Q \in \mathbb{Z}_{>0}$. We call this $Q$-rationality assumption. We propose a new polynomial-time algorithm for this task which is based on the seminal Lenstra-Lenstra-Lovasz (LLL) lattice basis reduction algorithm. We establish that under the $Q$-rationality assumption, our algorithm recovers exactly the vector $β^*$ for a large class of distributions for the iid entries of $X$ and non-zero noise $W$. We prove that it is successful under small noise, even when the learner has access to only one observation ($n=1$). Furthermore, we prove that in the case of the Gaussian white noise for $W$, $n=o\left(p/\log p\right)$ and $Q$ sufficiently large, our algorithm tolerates a nearly optimal information-theoretic level of the noise. △ Less

Submitted 8 November, 2018; v1 submitted 18 March, 2018; originally announced March 2018.

arXiv:1711.04952 [pdf, ps, other]

Sparse High-Dimensional Linear Regression. Algorithmic Barriers and a Local Search Algorithm

Authors: David Gamarnik, Ilias Zadik

Abstract: We consider a sparse high dimensional regression model where the goal is to recover a $k$-sparse unknown vector $β^*$ from $n$ noisy linear observations of the form $Y=Xβ^*+W \in \mathbb{R}^n$ where $X \in \mathbb{R}^{n \times p}$ has iid $N(0,1)$ entries and $W \in \mathbb{R}^n$ has iid $N(0,σ^2)$ entries. Under certain assumptions on the parameters, an intriguing assymptotic gap appears between… ▽ More We consider a sparse high dimensional regression model where the goal is to recover a $k$-sparse unknown vector $β^*$ from $n$ noisy linear observations of the form $Y=Xβ^*+W \in \mathbb{R}^n$ where $X \in \mathbb{R}^{n \times p}$ has iid $N(0,1)$ entries and $W \in \mathbb{R}^n$ has iid $N(0,σ^2)$ entries. Under certain assumptions on the parameters, an intriguing assymptotic gap appears between the minimum value of $n$, call it $n^*$, for which the recovery is information theoretically possible, and the minimum value of $n$, call it $n_{\mathrm{alg}}$, for which an efficient algorithm is known to provably recover $β^*$. In \cite{gamarnikzadik} it was conjectured that the gap is not artificial, in the sense that for sample sizes $n \in [n^*,n_{\mathrm{alg}}]$ the problem is algorithmically hard. We support this conjecture in two ways. Firstly, we show that the optimal solution of the LASSO provably fails to $\ell_2$-stably recover the unknown vector $β^*$ when $n \in [n^*,c n_{\mathrm{alg}}]$, for some sufficiently small constant $c>0$. Secondly, we establish that $n_{\mathrm{alg}}$, up to a multiplicative constant factor, is a phase transition point for the appearance of a certain Overlap Gap Property (OGP) over the space of $k$-sparse vectors. The presence of such an Overlap Gap Property phase transition, which originates in statistical physics, is known to provide evidence of an algorithmic hardness. Finally we show that if $n>C n_{\mathrm{alg}}$ for some large enough constant $C>0$, a very simple algorithm based on a local search improvement rule is able both to $\ell_2$-stably recover the unknown vector $β^*$ and to infer correctly its support, adding it to the list of provably successful algorithms for the high dimensional linear regression problem. △ Less

Submitted 22 September, 2019; v1 submitted 14 November, 2017; originally announced November 2017.

Comments: Added a result on the failure of the LASSO recovery mechanism in the conjectured algorithmically hard regime $n<c n_{alg}$ and minor corrections

arXiv:1702.02267 [pdf, ps, other]

Matrix Completion from $O(n)$ Samples in Linear Time

Authors: David Gamarnik, Quan Li, Hongyi Zhang

Abstract: We consider the problem of reconstructing a rank-$k$ $n \times n$ matrix $M$ from a sampling of its entries. Under a certain incoherence assumption on $M$ and for the case when both the rank and the condition number of $M$ are bounded, it was shown in \cite{CandesRecht2009, CandesTao2010, keshavan2010, Recht2011, Jain2012, Hardt2014} that $M$ can be recovered exactly or approximately (depending on… ▽ More We consider the problem of reconstructing a rank-$k$ $n \times n$ matrix $M$ from a sampling of its entries. Under a certain incoherence assumption on $M$ and for the case when both the rank and the condition number of $M$ are bounded, it was shown in \cite{CandesRecht2009, CandesTao2010, keshavan2010, Recht2011, Jain2012, Hardt2014} that $M$ can be recovered exactly or approximately (depending on some trade-off between accuracy and computational complexity) using $O(n \, \text{poly}(\log n))$ samples in super-linear time $O(n^{a} \, \text{poly}(\log n))$ for some constant $a \geq 1$. In this paper, we propose a new matrix completion algorithm using a novel sampling scheme based on a union of independent sparse random regular bipartite graphs. We show that under the same conditions w.h.p. our algorithm recovers an $ε$-approximation of $M$ in terms of the Frobenius norm using $O(n \log^2(1/ε))$ samples and in linear time $O(n \log^2(1/ε))$. This provides the best known bounds both on the sample complexity and computational complexity for reconstructing (approximately) an unknown low-rank matrix. The novelty of our algorithm is two new steps of thresholding singular values and rescaling singular vectors in the application of the "vanilla" alternating minimization algorithm. The structure of sparse random regular graphs is used heavily for controlling the impact of these regularization steps. △ Less

Submitted 22 August, 2017; v1 submitted 7 February, 2017; originally announced February 2017.

Comments: 45 pages, 1 figure. Short version accepted for presentation at Conference on Learning Theory (COLT) 2017

arXiv:1701.04455 [pdf, other]

High-Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transition

Authors: David Gamarnik, Ilias Zadik

Abstract: We consider a sparse linear regression model Y=Xβ^{*}+W where X has a Gaussian entries, W is the noise vector with mean zero Gaussian entries, and β^{*} is a binary vector with support size (sparsity) k. Using a novel conditional second moment method we obtain a tight up to a multiplicative constant approximation of the optimal squared error \min_β\|Y-Xβ\|_{2}, where the minimization is over all k… ▽ More We consider a sparse linear regression model Y=Xβ^{*}+W where X has a Gaussian entries, W is the noise vector with mean zero Gaussian entries, and β^{*} is a binary vector with support size (sparsity) k. Using a novel conditional second moment method we obtain a tight up to a multiplicative constant approximation of the optimal squared error \min_β\|Y-Xβ\|_{2}, where the minimization is over all k-sparse binary vectors β. The approximation reveals interesting structural properties of the underlying regression problem. In particular, a) We establish that n^*=2k\log p/\log (2k/σ^{2}+1) is a phase transition point with the following "all-or-nothing" property. When n exceeds n^{*}, (2k)^{-1}\|β_{2}-β^*\|_0\approx 0, and when n is below n^{*}, (2k)^{-1}\|β_{2}-β^*\|_0\approx 1, where β_2 is the optimal solution achieving the smallest squared error. With this we prove that n^{*} is the asymptotic threshold for recovering β^* information theoretically. b) We compute the squared error for an intermediate problem \min_β\|Y-Xβ\|_{2} where minimization is restricted to vectors βwith \|β-β^{*}\|_0=2k ζ, for ζ\in [0,1]. We show that a lower bound part Γ(ζ) of the estimate, which corresponds to the estimate based on the first moment method, undergoes a phase transition at three different thresholds, namely n_{\text{inf,1}}=σ^2\log p, which is information theoretic bound for recovering β^* when k=1 and σis large, then at n^{*} and finally at n_{\text{LASSO/CS}}. c) We establish a certain Overlap Gap Property (OGP) on the space of all binary vectors βwhen n\le ck\log p for sufficiently small constant c. We conjecture that OGP is the source of algorithmic hardness of solving the minimization problem \min_β\|Y-Xβ\|_{2} in the regime n<n_{\text{LASSO/CS}}. △ Less

Submitted 25 September, 2019; v1 submitted 16 January, 2017; originally announced January 2017.

Comments: 36 pages, 5 figures

arXiv:1603.06002 [pdf, ps, other]

A Message Passing Algorithm for the Problem of Path Packing in Graphs

Authors: Patrick Eschenfeldt, David Gamarnik

Abstract: We consider the problem of packing node-disjoint directed paths in a directed graph. We consider a variant of this problem where each path starts within a fixed subset of root nodes, subject to a given bound on the length of paths. This problem is motivated by the so-called kidney exchange problem, but has potential other applications and is interesting in its own right. We propose a new algorit… ▽ More We consider the problem of packing node-disjoint directed paths in a directed graph. We consider a variant of this problem where each path starts within a fixed subset of root nodes, subject to a given bound on the length of paths. This problem is motivated by the so-called kidney exchange problem, but has potential other applications and is interesting in its own right. We propose a new algorithm for this problem based on the message passing/belief propagation technique. A priori this problem does not have an associated graphical model, so in order to apply a belief propagation algorithm we provide a novel representation of the problem as a graphical model. Standard belief propagation on this model has poor scaling behavior, so we provide an efficient implementation that significantly decreases the complexity. We provide numerical results comparing the performance of our algorithm on both artificially created graphs and real world networks to several alternative algorithms, including algorithms based on integer programming (IP) techniques. These comparisons show that our algorithm scales better to large instances than IP-based algorithms and often finds better solutions than a simple algorithm that greedily selects the longest path from each root node. In some cases it also finds better solutions than the ones found by IP-based algorithms even when the latter are allowed to run significantly longer than our algorithm. △ Less

Submitted 18 March, 2016; originally announced March 2016.

Comments: 34 pages

arXiv:1602.02164 [pdf, other]

doi 10.1109/LSP.2016.2576979

A Note on Alternating Minimization Algorithm for the Matrix Completion Problem

Authors: David Gamarnik, Sidhant Misra

Abstract: We consider the problem of reconstructing a low rank matrix from a subset of its entries and analyze two variants of the so-called Alternating Minimization algorithm, which has been proposed in the past. We establish that when the underlying matrix has rank $r=1$, has positive bounded entries, and the graph $\mathcal{G}$ underlying the revealed entries has bounded degree and diameter which is at m… ▽ More We consider the problem of reconstructing a low rank matrix from a subset of its entries and analyze two variants of the so-called Alternating Minimization algorithm, which has been proposed in the past. We establish that when the underlying matrix has rank $r=1$, has positive bounded entries, and the graph $\mathcal{G}$ underlying the revealed entries has bounded degree and diameter which is at most logarithmic in the size of the matrix, both algorithms succeed in reconstructing the matrix approximately in polynomial time starting from an arbitrary initialization. We further provide simulation results which suggest that the second algorithm which is based on the message passing type updates, performs significantly better. △ Less

Submitted 5 February, 2016; originally announced February 2016.

Comments: 8 pages, 2 figures

arXiv:1412.1443 [pdf, ps, other]

Structure learning of antiferromagnetic Ising models

Authors: Guy Bresler, David Gamarnik, Devavrat Shah

Abstract: In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. We first observe that the notoriously difficult problem of learning parities with noise can be captured as a special case of learning graphical models. This leads to an unconditional computational lower bound of $Ω(p^{d/2})$ for learning ge… ▽ More In this paper we investigate the computational complexity of learning the graph structure underlying a discrete undirected graphical model from i.i.d. samples. We first observe that the notoriously difficult problem of learning parities with noise can be captured as a special case of learning graphical models. This leads to an unconditional computational lower bound of $Ω(p^{d/2})$ for learning general graphical models on $p$ nodes of maximum degree $d$, for the class of so-called statistical algorithms recently introduced by Feldman et al (2013). The lower bound suggests that the $O(p^d)$ runtime required to exhaustively search over neighborhoods cannot be significantly improved without restricting the class of models. Aside from structural assumptions on the graph such as it being a tree, hypertree, tree-like, etc., many recent papers on structure learning assume that the model has the correlation decay property. Indeed, focusing on ferromagnetic Ising models, Bento and Montanari (2009) showed that all known low-complexity algorithms fail to learn simple graphs when the interaction strength exceeds a number related to the correlation decay threshold. Our second set of results gives a class of repelling (antiferromagnetic) models that have the opposite behavior: very strong interaction allows efficient learning in time $O(p^2)$. We provide an algorithm whose performance interpolates between $O(p^2)$ and $O(p^{d+2})$ depending on the strength of the repulsion. △ Less

Submitted 3 December, 2014; originally announced December 2014.

Comments: 15 pages. NIPS 2014

arXiv:1410.7659 [pdf, ps, other]

Learning graphical models from the Glauber dynamics

Authors: Guy Bresler, David Gamarnik, Devavrat Shah

Abstract: In this paper we consider the problem of learning undirected graphical models from data generated according to the Glauber dynamics. The Glauber dynamics is a Markov chain that sequentially updates individual nodes (variables) in a graphical model and it is frequently used to sample from the stationary distribution (to which it converges given sufficient time). Additionally, the Glauber dynamics i… ▽ More In this paper we consider the problem of learning undirected graphical models from data generated according to the Glauber dynamics. The Glauber dynamics is a Markov chain that sequentially updates individual nodes (variables) in a graphical model and it is frequently used to sample from the stationary distribution (to which it converges given sufficient time). Additionally, the Glauber dynamics is a natural dynamical model in a variety of settings. This work deviates from the standard formulation of graphical model learning in the literature, where one assumes access to i.i.d. samples from the distribution. Much of the research on graphical model learning has been directed towards finding algorithms with low computational cost. As the main result of this work, we establish that the problem of reconstructing binary pairwise graphical models is computationally tractable when we observe the Glauber dynamics. Specifically, we show that a binary pairwise graphical model on $p$ nodes with maximum degree $d$ can be learned in time $f(d)p^2\log p$, for a function $f(d)$, using nearly the information-theoretic minimum number of samples. △ Less

Submitted 28 November, 2014; v1 submitted 28 October, 2014; originally announced October 2014.

Comments: 9 pages. Appeared in Allerton Conference 2014

arXiv:1409.3836 [pdf, ps, other]

Hardness of parameter estimation in graphical models

Authors: Guy Bresler, David Gamarnik, Devavrat Shah

Abstract: We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility o… ▽ More We consider the problem of learning the canonical parameters specifying an undirected graphical model (Markov random field) from the mean parameters. For graphical models representing a minimal exponential family, the canonical parameters are uniquely determined by the mean parameters, so the problem is feasible in principle. The goal of this paper is to investigate the computational feasibility of this statistical task. Our main result shows that parameter estimation is in general intractable: no algorithm can learn the canonical parameters of a generic pair-wise binary graphical model from the mean parameters in time bounded by a polynomial in the number of variables (unless RP = NP). Indeed, such a result has been believed to be true (see the monograph by Wainwright and Jordan (2008)) but no proof was known. Our proof gives a polynomial time reduction from approximating the partition function of the hard-core model, known to be hard, to learning approximate parameters. Our reduction entails showing that the marginal polytope boundary has an inherent repulsive property, which validates an optimization procedure over the polytope that does not use any knowledge of its structure (as required by the ellipsoid method and others). △ Less

Submitted 17 September, 2014; v1 submitted 12 September, 2014; originally announced September 2014.

Comments: 15 pages. To appear in NIPS 2014

Showing 1–15 of 15 results for author: Gamarnik, D