-
Convergence Rates for Stochastic Approximation: Biased Noise with Unbounded Variance, and Applications
Authors:
Rajeeva L. Karandikar,
M. Vidyasagar
Abstract:
In this paper, we study the convergence properties of the Stochastic Gradient Descent (SGD) method for finding a stationary point of a given objective function $J(\cdot)$. The objective function is not required to be convex. Rather, our results apply to a class of ``invex'' functions, which have the property that every stationary point is also a global minimizer. First, it is assumed that…
▽ More
In this paper, we study the convergence properties of the Stochastic Gradient Descent (SGD) method for finding a stationary point of a given objective function $J(\cdot)$. The objective function is not required to be convex. Rather, our results apply to a class of ``invex'' functions, which have the property that every stationary point is also a global minimizer. First, it is assumed that $J(\cdot)$ satisfies a property that is slightly weaker than the Kurdyka-Lojasiewicz (KL) condition, denoted here as (KL'). It is shown that the iterations $J({\boldsymbol θ}_t)$ converge almost surely to the global minimum of $J(\cdot)$. Next, the hypothesis on $J(\cdot)$ is strengthened from (KL') to the Polyak-Lojasiewicz (PL) condition. With this stronger hypothesis, we derive estimates on the rate of convergence of $J({\boldsymbol θ}_t)$ to its limit. Using these results, we show that for functions satisfying the PL property, the convergence rate of SGD is the same as the best-possible rate for convex functions. While some results along these lines have been published in the past, our contributions contain two distinct improvements. First, the assumptions on the stochastic gradient are more general than elsewhere, and second, our convergence is almost sure, and not in expectation. We also study SGD when only function evaluations are permitted. In this setting, we determine the ``optimal'' increments or the size of the perturbations. Using the same set of ideas, we establish the global convergence of the Stochastic Approximation (SA) algorithm under more general assumptions on the measurement error, compared to the existing literature. We also derive bounds on the rate of convergence of the SA algorithm under appropriate assumptions.
△ Less
Submitted 12 May, 2024; v1 submitted 5 December, 2023;
originally announced December 2023.
-
A Tutorial Introduction to Reinforcement Learning
Authors:
Mathukumalli Vidyasagar
Abstract:
In this paper, we present a brief survey of Reinforcement Learning (RL), with particular emphasis on Stochastic Approximation (SA) as a unifying theme. The scope of the paper includes Markov Reward Processes, Markov Decision Processes, Stochastic Approximation algorithms, and widely used algorithms such as Temporal Difference Learning and $Q$-learning.
In this paper, we present a brief survey of Reinforcement Learning (RL), with particular emphasis on Stochastic Approximation (SA) as a unifying theme. The scope of the paper includes Markov Reward Processes, Markov Decision Processes, Stochastic Approximation algorithms, and widely used algorithms such as Temporal Difference Learning and $Q$-learning.
△ Less
Submitted 3 April, 2023;
originally announced April 2023.
-
Convergence of Momentum-Based Heavy Ball Method with Batch Updating and/or Approximate Gradients
Authors:
Tadipatri Uday Kiran Reddy,
Mathukumalli Vidyasagar
Abstract:
In this paper, we study the well-known "Heavy Ball" method for convex and nonconvex optimization introduced by Polyak in 1964, and establish its convergence under a variety of situations. Traditionally, most algorithms use "full-coordinate update," that is, at each step, every component of the argument is updated. However, when the dimension of the argument is very high, it is more efficient to up…
▽ More
In this paper, we study the well-known "Heavy Ball" method for convex and nonconvex optimization introduced by Polyak in 1964, and establish its convergence under a variety of situations. Traditionally, most algorithms use "full-coordinate update," that is, at each step, every component of the argument is updated. However, when the dimension of the argument is very high, it is more efficient to update some but not all components of the argument at each iteration. We refer to this as "batch updating" in this paper. When gradient-based algorithms are used together with batch updating, in principle it is sufficient to compute only those components of the gradient for which the argument is to be updated. However, if a method such as backpropagation is used to compute these components, computing only some components of gradient does not offer much savings over computing the entire gradient. Therefore, to achieve a noticeable reduction in CPU usage at each step, one can use first-order differences to approximate the gradient. The resulting estimates are biased, and also have unbounded variance. Thus some delicate analysis is required to ensure that the HB algorithm converge when batch updating is used instead of full-coordinate updating, and/or approximate gradients are used instead of true gradients. In this paper, we establish the almost sure convergence of the iterations to the stationary point(s) of the objective function under suitable conditions; in addition, we also derive upper bounds on the rate of convergence. To the best of our knowledge, there is no other paper that combines all of these features. This paper is dedicated to the memory of Boris Teodorovich Polyak
△ Less
Submitted 10 June, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.
-
Estimating large causal polytrees from small samples
Authors:
Sourav Chatterjee,
Mathukumalli Vidyasagar
Abstract:
We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no dis…
▽ More
We consider the problem of estimating a large causal polytree from a relatively small i.i.d. sample. This is motivated by the problem of determining causal structure when the number of variables is very large compared to the sample size, such as in gene regulatory networks. We give an algorithm that recovers the tree with high accuracy in such settings. The algorithm works under essentially no distributional or modeling assumptions other than some mild non-degeneracy conditions.
△ Less
Submitted 29 March, 2024; v1 submitted 14 September, 2022;
originally announced September 2022.
-
Convergence of Batch Updating Methods with Approximate Gradients and/or Noisy Measurements: Theory and Computational Results
Authors:
Tadipatri Uday Kiran Reddy,
M. Vidyasagar
Abstract:
In this paper, we present a unified and general framework for analyzing the batch updating approach to nonlinear, high-dimensional optimization. The framework encompasses all the currently used batch updating approaches, and is applicable to nonconvex as well as convex functions. Moreover, the framework permits the use of noise-corrupted gradients, as well as first-order approximations to the grad…
▽ More
In this paper, we present a unified and general framework for analyzing the batch updating approach to nonlinear, high-dimensional optimization. The framework encompasses all the currently used batch updating approaches, and is applicable to nonconvex as well as convex functions. Moreover, the framework permits the use of noise-corrupted gradients, as well as first-order approximations to the gradient (sometimes referred to as "gradient-free" approaches). By viewing the analysis of the iterations as a problem in the convergence of stochastic processes, we are able to establish a very general theorem, which includes most known convergence results for zeroth-order and first-order methods. The analysis of "second-order" or momentum-based methods is not a part of this paper, and will be studied elsewhere. However, numerical experiments indicate that momentum-based methods can fail if the true gradient is replaced by its first-order approximation. This requires further theoretical analysis.
△ Less
Submitted 27 January, 2023; v1 submitted 12 September, 2022;
originally announced September 2022.
-
Convergence of Stochastic Approximation via Martingale and Converse Lyapunov Methods
Authors:
M. Vidyasagar
Abstract:
In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under addition…
▽ More
In this paper, we study the almost sure boundedness and the convergence of the stochastic approximation (SA) algorithm. At present, most available convergence proofs are based on the ODE method, and the almost sure boundedness of the iterations is an assumption and not a conclusion. In Borkar-Meyn (2000), it is shown that if the ODE has only one globally attractive equilibrium, then under additional assumptions, the iterations are bounded almost surely, and the SA algorithm converges to the desired solution. Our objective in the present paper is to provide an alternate proof of the above, based on martingale methods, which are simpler and less technical than those based on the ODE method. As a prelude, we prove a new sufficient condition for the global asymptotic stability of an ODE. Next we prove a "converse" Lyapunov theorem on the existence of a suitable Lyapunov function with a globally bounded Hessian, for a globally exponentially stable system. Both theorems are of independent interest to researchers in stability theory. Then, using these results, we provide sufficient conditions for the almost sure boundedness and the convergence of the SA algorithm. We show through examples that our theory covers some situations that are not covered by currently known results, specifically Borkar-Meyn (2000).
△ Less
Submitted 9 January, 2023; v1 submitted 3 May, 2022;
originally announced May 2022.
-
Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning
Authors:
Rajeeva L. Karandikar,
M. Vidyasagar
Abstract:
Ever since its introduction in the classic paper of Robbins and Monro in 1951, Stochastic Approximation (SA) has become a standard tool for finding a solution of an equation of the form $f(θ) = 0$, when only noisy measurements of $f(\cdot)$ are available. In most situations, \textit{every component} of the putative solution $θ_t$ is updated at each step $t$. In some applications such as $Q$-learni…
▽ More
Ever since its introduction in the classic paper of Robbins and Monro in 1951, Stochastic Approximation (SA) has become a standard tool for finding a solution of an equation of the form $f(θ) = 0$, when only noisy measurements of $f(\cdot)$ are available. In most situations, \textit{every component} of the putative solution $θ_t$ is updated at each step $t$. In some applications such as $Q$-learning, a key technique in Reinforcement Learning (RL), \textit{only one component} of $θ_t$ is updated at each $t$. This is known as \textbf{asynchronous} SA. The topic of study in the present paper is to study \textbf{Block Asynchronous SA (BASA)}, in which, at each step $t$, \textit{some but not necessarily all} components of $θ_t$ are updated. The theory presented here embraces both conventional (synchronous) SA as well as asynchronous SA, and all in-between possibilities. We also prove bounds on the \textit{rate} of convergence of $θ_t$ to the solutions.
As a prelude to the new results, we also briefly survey some results on the convergence of the Stochastic Gradient method, proved in a companion paper by the present authors.
△ Less
Submitted 20 February, 2024; v1 submitted 8 September, 2021;
originally announced September 2021.
-
SUTRA: A Novel Approach to Modelling Pandemics with Applications to COVID-19
Authors:
Manindra Agrawal,
Madhuri Kanitkar,
Deepu Phillip,
Tanima Hajra,
Arti Singh,
Avaneesh Singh,
Prabal Pratap Singh,
Mathukumalli Vidyasagar
Abstract:
The Covid-19 pandemic has two key properties: (i) asymptomatic cases (both detected and undetected) that can result in new infections, and (ii) time-varying characteristics due to new variants, Non-Pharmaceutical Interventions etc. We develop a model called SUTRA (Susceptible, Undetected though infected, Tested positive, and Removed Analysis) that takes into account both of these two key propertie…
▽ More
The Covid-19 pandemic has two key properties: (i) asymptomatic cases (both detected and undetected) that can result in new infections, and (ii) time-varying characteristics due to new variants, Non-Pharmaceutical Interventions etc. We develop a model called SUTRA (Susceptible, Undetected though infected, Tested positive, and Removed Analysis) that takes into account both of these two key properties. While applying the model to a region, two parameters of the model can be learnt from the number of daily new cases found in the region. Using the learnt values of the parameters the model can predict the number of daily new cases so long as the learnt parameters do not change substantially. Whenever any of the two parameters changes due to the key property (ii) above, the SUTRA model can detect that the values of one or both of the parameters have changed. Further, the model has the capability to relearn the changed parameter values, and then use these to carry out the prediction of the trajectory of the pandemic for the region of concern. The SUTRA approach can be applied at various levels of granularity, from an entire country to a district, more specifically, to any large enough region for which the data of daily new cases are available.
We have applied the SUTRA model to thirty-two countries, covering more than half of the world's population. Our conclusions are: (i) The model is able to capture the past trajectories very well. Moreover, the parameter values, which we can estimate robustly, help quantify the impact of changes in the pandemic characteristics. (ii) Unless the pandemic characteristics change significantly, the model has good predictive capability. (iii) Natural immunity provides significantly better protection against infection than the currently available vaccines.
△ Less
Submitted 25 October, 2022; v1 submitted 22 January, 2021;
originally announced January 2021.
-
Estimating Hidden Asymptomatics, Herd Immunity Threshold and Lockdown Effects using a COVID-19 Specific Model
Authors:
Shaurya Kaushal,
Abhineet Singh Rajput,
Soumyadeep Bhattacharya,
M. Vidyasagar,
Aloke Kumar,
Meher K. Prakash,
Santosh Ansumali
Abstract:
A quantitative COVID-19 model that incorporates hidden asymptomatic patients is developed, and an analytic solution in parametric form is given. The model incorporates the impact of lockdown and resulting spatial migration of population due to announcement of lockdown. A method is presented for estimating the model parameters from real-world data. It is shown that increase of infections slows down…
▽ More
A quantitative COVID-19 model that incorporates hidden asymptomatic patients is developed, and an analytic solution in parametric form is given. The model incorporates the impact of lockdown and resulting spatial migration of population due to announcement of lockdown. A method is presented for estimating the model parameters from real-world data. It is shown that increase of infections slows down and herd immunity is achieved when symptomatic patients are 4-6\% of the population for the European countries we studied, when the total infected fraction is between 50-56 \%. Finally, a method for estimating the number of asymptomatic patients, who have been the key hidden link in the spread of the infections, is presented.
△ Less
Submitted 29 May, 2020;
originally announced June 2020.
-
New and Explicit Constructions of Unbalanced Ramanujan Bipartite Graphs
Authors:
Shantanu Prasad Burnwal,
Kaneenika Sinha,
Mathukumalli Vidyasagar
Abstract:
The objectives of this article are three-fold. Firstly, we present for the first time explicit constructions of an infinite family of \textit{unbalanced} Ramanujan bigraphs. Secondly, we revisit some of the known methods for constructing Ramanujan graphs and discuss the computational work required in actually implementing the various construction methods. The third goal of this article is to addre…
▽ More
The objectives of this article are three-fold. Firstly, we present for the first time explicit constructions of an infinite family of \textit{unbalanced} Ramanujan bigraphs. Secondly, we revisit some of the known methods for constructing Ramanujan graphs and discuss the computational work required in actually implementing the various construction methods. The third goal of this article is to address the following question: can we construct a bipartite Ramanujan graph with specified degrees, but with the restriction that the edge set of this graph must be distinct from a given set of "prohibited" edges? We provide an affirmative answer in many cases, as long as the set of prohibited edges is not too large.
△ Less
Submitted 12 November, 2020; v1 submitted 8 October, 2019;
originally announced October 2019.
-
Deterministic Completion of Rectangular Matrices Using Asymmetric Ramanujan Graphs: Exact and Stable Recovery
Authors:
Shantanu Prasad Burnwal,
Mathukumalli Vidyasagar
Abstract:
In this paper we study the matrix completion problem: Suppose $X \in {\mathbb R}^{n_r \times n_c}$ is unknown except for a known upper bound $r$ on its rank. By measuring a small number $m \ll n_r n_c$ of elements of $X$, is it possible to recover $X$ exactly with noise-free measurements, or to construct a good approximation of $X$ with noisy measurements? Existing solutions to these problems invo…
▽ More
In this paper we study the matrix completion problem: Suppose $X \in {\mathbb R}^{n_r \times n_c}$ is unknown except for a known upper bound $r$ on its rank. By measuring a small number $m \ll n_r n_c$ of elements of $X$, is it possible to recover $X$ exactly with noise-free measurements, or to construct a good approximation of $X$ with noisy measurements? Existing solutions to these problems involve sampling the elements uniformly and at random, and can guarantee exact recovery of the unknown matrix only with high probability. In this paper, we present a \textit{deterministic} sampling method for matrix completion. We achieve this by choosing the sampling set as the edge set of an asymmetric Ramanujan bigraph, and constrained nuclear norm minimization is the recovery method. Specifically, we derive sufficient conditions under which the unknown matrix is completed exactly with noise-free measurements, and is approximately completed with noisy measurements, which we call "stable" completion.
The conditions derived here are only sufficient and more restrictive than random sampling. To study how close they are to being necessary, we conducted numerical simulations on randomly generated low rank matrices, using the LPS families of Ramanujan graphs. These simulations demonstrate two facts: (i) In order to achieve exact completion, it appears sufficient to choose the degree $d$ of the Ramanujan graph to be $\geq 3r$. (ii) There is a "phase transition," whereby the likelihood of success suddenly drops from 100\% to 0\% if the rank is increased by just one or two beyond a critical value. The phase transition phenomenon is well-known and well-studied in vector recovery using $\ell_1$-norm minimization. However, it is less studied in matrix completion and nuclear norm minimization, and not much understood.
△ Less
Submitted 21 May, 2020; v1 submitted 2 August, 2019;
originally announced August 2019.
-
Compressed Sensing Using Binary Matrices of Nearly Optimal Dimensions
Authors:
Mahsa Lotfi,
Mathukumalli Vidyasagar
Abstract:
In this paper, we study the problem of compressed sensing using binary measurement matrices and $\ell_1$-norm minimization (basis pursuit) as the recovery algorithm. We derive new upper and lower bounds on the number of measurements to achieve robust sparse recovery with binary matrices. We establish sufficient conditions for a column-regular binary matrix to satisfy the robust null space property…
▽ More
In this paper, we study the problem of compressed sensing using binary measurement matrices and $\ell_1$-norm minimization (basis pursuit) as the recovery algorithm. We derive new upper and lower bounds on the number of measurements to achieve robust sparse recovery with binary matrices. We establish sufficient conditions for a column-regular binary matrix to satisfy the robust null space property (RNSP) and show that the associated sufficient conditions % sparsity bounds for robust sparse recovery obtained using the RNSP are better by a factor of $(3 \sqrt{3})/2 \approx 2.6$ compared to the sufficient conditions obtained using the restricted isometry property (RIP). Next we derive universal \textit{lower} bounds on the number of measurements that any binary matrix needs to have in order to satisfy the weaker sufficient condition based on the RNSP and show that bipartite graphs of girth six are optimal. Then we display two classes of binary matrices, namely parity check matrices of array codes and Euler squares, which have girth six and are nearly optimal in the sense of almost satisfying the lower bound. In principle, randomly generated Gaussian measurement matrices are "order-optimal". So we compare the phase transition behavior of the basis pursuit formulation using binary array codes and Gaussian matrices and show that (i) there is essentially no difference between the phase transition boundaries in the two cases and (ii) the CPU time of basis pursuit with binary matrices is hundreds of times faster than with Gaussian matrices and the storage requirements are less. Therefore it is suggested that binary matrices are a viable alternative to Gaussian matrices for compressed sensing using basis pursuit. \end{abstract}
△ Less
Submitted 26 April, 2020; v1 submitted 8 August, 2018;
originally announced August 2018.
-
An Approach to One-Bit Compressed Sensing Based on Probably Approximately Correct Learning Theory
Authors:
Mehmet Eren Ahsen,
Mathukumalli Vidyasagar
Abstract:
In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\mathbb{R}^n$ generated by $k$-sparse vectors is bounded below by $k \lg (n/k)$ and above by $2k \lg (n/k)$, plus some round-off terms. By coupling this estimate with well-…
▽ More
In this paper, the problem of one-bit compressed sensing (OBCS) is formulated as a problem in probably approximately correct (PAC) learning. It is shown that the Vapnik-Chervonenkis (VC-) dimension of the set of half-spaces in $\mathbb{R}^n$ generated by $k$-sparse vectors is bounded below by $k \lg (n/k)$ and above by $2k \lg (n/k)$, plus some round-off terms. By coupling this estimate with well-established results in PAC learning theory, we show that a consistent algorithm can recover a $k$-sparse vector with $O(k \lg (n/k))$ measurements, given only the signs of the measurement vector. This result holds for \textit{all} probability measures on $\mathbb{R}^n$. It is further shown that random sign-flip** errors result only in an increase in the constant in the $O(k \lg (n/k))$ estimate. Because constructing a consistent algorithm is not straight-forward, we present a heuristic based on the $\ell_1$-norm support vector machine, and illustrate that its computational performance is superior to a currently popular method.
△ Less
Submitted 22 October, 2017;
originally announced October 2017.
-
CLOT Norm Minimization for Continuous Hands-off Control
Authors:
Niharika Challapalli,
Masaaki Nagahara,
Mathukumalli Vidyasagar
Abstract:
In this paper, we consider hands-off control via minimization of the CLOT (Combined $L$-One and Two) norm. The maximum hands-off control is the $L^0$-optimal (or the sparsest) control among all feasible controls that are bounded by a specified value and transfer the state from a given initial state to the origin within a fixed time duration. In general, the maximum hands-off control is a bang-off-…
▽ More
In this paper, we consider hands-off control via minimization of the CLOT (Combined $L$-One and Two) norm. The maximum hands-off control is the $L^0$-optimal (or the sparsest) control among all feasible controls that are bounded by a specified value and transfer the state from a given initial state to the origin within a fixed time duration. In general, the maximum hands-off control is a bang-off-bang control taking values of $\pm 1$ and $0$. For many real applications, such discontinuity in the control is not desirable. To obtain a continuous but still relatively sparse control, we propose to use the CLOT norm, a convex combination of $L^1$ and $L^2$ norms. We show by numerical simulations that the CLOT control is continuous and much sparser (i.e. has longer time duration on which the control takes 0) than the conventional EN (elastic net) control, which is a convex combination of $L^1$ and squared $L^2$ norms. We also prove that the CLOT control is continuous in the sense that, if $O(h)$ denotes the sampling period, then the difference between successive values of the CLOT-optimal control is $O(\sqrt{h})$, which is a form of continuity. Also, the CLOT formulation is extended to encompass constraints on the state variable.
△ Less
Submitted 22 October, 2017;
originally announced October 2017.
-
A Fast Noniterative Algorithm for Compressive Sensing Using Binary Measurement Matrices
Authors:
Mahsa Lotfi,
Mathukumalli Vidyasagar
Abstract:
In this paper we present a new algorithm for compressive sensing that makes use of binary measurement matrices and achieves exact recovery of ultra sparse vectors, in a single pass and without any iterations. Due to its noniterative nature, our algorithm is hundreds of times faster than $\ell_1$-norm minimization, and methods based on expander graphs, both of which require multiple iterations. Our…
▽ More
In this paper we present a new algorithm for compressive sensing that makes use of binary measurement matrices and achieves exact recovery of ultra sparse vectors, in a single pass and without any iterations. Due to its noniterative nature, our algorithm is hundreds of times faster than $\ell_1$-norm minimization, and methods based on expander graphs, both of which require multiple iterations. Our algorithm can accommodate nearly sparse vectors, in which case it recovers index set of the largest components, and can also accommodate burst noise measurements. Compared to compressive sensing methods that are guaranteed to achieve exact recovery of all sparse vectors, our method requires fewer measurements However, methods that achieve statistical recovery, that is, recovery of almost all but not all sparse vectors, can require fewer measurements than our method.
△ Less
Submitted 21 May, 2018; v1 submitted 11 August, 2017;
originally announced August 2017.
-
Continuous Hands-off Control by CLOT Norm Minimization
Authors:
Niharika Challapalli,
Masaaki Nagahara,
Mathukumalli Vidyasagar
Abstract:
In this paper, we consider hands-off control via minimization of the CLOT (Combined L-One and Two) norm. The maximum hands-off control is the L0-optimal (or the sparsest) control among all feasible controls that are bounded by a specified value and transfer the state from a given initial state to the origin within a fixed time duration. In general, the maximum hands-off control is a bang-off-bang…
▽ More
In this paper, we consider hands-off control via minimization of the CLOT (Combined L-One and Two) norm. The maximum hands-off control is the L0-optimal (or the sparsest) control among all feasible controls that are bounded by a specified value and transfer the state from a given initial state to the origin within a fixed time duration. In general, the maximum hands-off control is a bang-off-bang control taking values of +1, -1, and 0. For many real applications, such discontinuity in the control is not desirable. To obtain a continuous but still relatively sparse control, we propose to use the CLOT norm, a convex combination of L1 and L2 norms. We show by numerical simulation that the CLOT control is continuous and much sparser (i.e. has longer time duration on which the control takes 0) than the conventional EN (elastic net) control, which is a convex combination of L1 and squared L2 norms.
△ Less
Submitted 7 November, 2016;
originally announced November 2016.
-
Tight Performance Bounds for Compressed Sensing With Conventional and Group Sparsity
Authors:
Shashank Ranjan,
Mathukumalli Vidyasagar
Abstract:
In this paper, we study the problem of recovering a group sparse vector from a small number of linear measurements. In the past the common approach has been to use various "group sparsity-inducing" norms such as the Group LASSO norm for this purpose. By using the theory of convex relaxations, we show that it is also possible to use $\ell_1$-norm minimization for group sparse recovery. We introduce…
▽ More
In this paper, we study the problem of recovering a group sparse vector from a small number of linear measurements. In the past the common approach has been to use various "group sparsity-inducing" norms such as the Group LASSO norm for this purpose. By using the theory of convex relaxations, we show that it is also possible to use $\ell_1$-norm minimization for group sparse recovery. We introduce a new concept called group robust null space property (GRNSP), and show that, under suitable conditions, a group version of the restricted isometry property (GRIP) implies the GRNSP, and thus leads to group sparse recovery. When all groups are of equal size, our bounds are less conservative than known bounds. Moreover, our results apply even to situations where where the groups have different sizes. When specialized to conventional sparsity, our bounds reduce to one of the well-known "best possible" conditions for sparse recovery. This relationship between GRNSP and GRIP is new even for conventional sparsity, and substantially streamlines the proofs of some known results. Using this relationship, we derive bounds on the $\ell_p$-norm of the residual error vector for all $p \in [1,2]$, and not just when $p = 2$. When the measurement matrix consists of random samples of a sub-Gaussian random variable, we present bounds on the number of measurements, which are less conservative than currently known bounds.
△ Less
Submitted 28 July, 2018; v1 submitted 19 June, 2016;
originally announced June 2016.
-
Error Bounds for Compressed Sensing Algorithms With Group Sparsity: A Unified Approach
Authors:
M. Eren Ahsen,
M. Vidyasagar
Abstract:
In compressed sensing, in order to recover a sparse or nearly sparse vector from possibly noisy measurements, the most popular approach is $\ell_1$-norm minimization. Upper bounds for the $\ell_2$- norm of the error between the true and estimated vectors are given in [1] and reviewed in [2], while bounds for the $\ell_1$-norm are given in [3]. When the unknown vector is not conventionally sparse b…
▽ More
In compressed sensing, in order to recover a sparse or nearly sparse vector from possibly noisy measurements, the most popular approach is $\ell_1$-norm minimization. Upper bounds for the $\ell_2$- norm of the error between the true and estimated vectors are given in [1] and reviewed in [2], while bounds for the $\ell_1$-norm are given in [3]. When the unknown vector is not conventionally sparse but is "group sparse" instead, a variety of alternatives to the $\ell_1$-norm have been proposed in the literature, including the group LASSO, sparse group LASSO, and group LASSO with tree structured overlap** groups. However, no error bounds are available for any of these modified objective functions. In the present paper, a unified approach is presented for deriving upper bounds on the error between the true vector and its approximation, based on the notion of decomposable and $γ$-decomposable norms. The bounds presented cover all of the norms mentioned above, and also provide a guideline for choosing norms in future to accommodate alternate forms of sparsity.
△ Less
Submitted 29 December, 2015;
originally announced December 2015.
-
Two New Approaches to Compressed Sensing Exhibiting Both Robust Sparse Recovery and the Grou** Effect
Authors:
Mehmet Eren Ahsen,
Niharika Challapalli,
Mathukumalli Vidyasagar
Abstract:
In this paper we introduce a new optimization formulation for sparse regression and compressed sensing, called CLOT (Combined L-One and Two), wherein the regularizer is a convex combination of the $\ell_1$- and $\ell_2$-norms. This formulation differs from the Elastic Net (EN) formulation, in which the regularizer is a convex combination of the $\ell_1$- and $\ell_2$-norm squared. It is shown that…
▽ More
In this paper we introduce a new optimization formulation for sparse regression and compressed sensing, called CLOT (Combined L-One and Two), wherein the regularizer is a convex combination of the $\ell_1$- and $\ell_2$-norms. This formulation differs from the Elastic Net (EN) formulation, in which the regularizer is a convex combination of the $\ell_1$- and $\ell_2$-norm squared. It is shown that, in the context of compressed sensing, the EN formulation does not achieve robust recovery of sparse vectors, whereas the new CLOT formulation achieves robust recovery. Also, like EN but unlike LASSO, the CLOT formulation achieves the grou** effect, wherein coefficients of highly correlated columns of the measurement (or design) matrix are assigned roughly comparable values. It is already known LASSO does not have the grou** effect. Therefore the CLOT formulation combines the best features of both LASSO (robust sparse recovery) and EN (grou** effect).
The CLOT formulation is a special case of another one called SGL (Sparse Group LASSO) which was introduced into the literature previously, but without any analysis of either the grou** effect or robust sparse recovery. It is shown here that SGL achieves robust sparse recovery, and also achieves a version of the grou** effect in that coefficients of highly correlated columns belonging to the same group of the measurement (or design) matrix are assigned roughly comparable values.
△ Less
Submitted 20 June, 2017; v1 submitted 29 October, 2014;
originally announced October 2014.
-
Machine Learning Methods in the Computational Biology of Cancer
Authors:
Mathukumalli Vidyasagar
Abstract:
The objectives of this "perspective" paper are to review some recent advances in sparse feature selection for regression and classification, as well as compressed sensing, and to discuss how these might be used to develop tools to advance personalized cancer therapy. As an illustration of the possibilities, a new algorithm for sparse regression is presented, and is applied to predict the time to t…
▽ More
The objectives of this "perspective" paper are to review some recent advances in sparse feature selection for regression and classification, as well as compressed sensing, and to discuss how these might be used to develop tools to advance personalized cancer therapy. As an illustration of the possibilities, a new algorithm for sparse regression is presented, and is applied to predict the time to tumor recurrence in ovarian cancer. A new algorithm for sparse feature selection in classification problems is presented, and its validation in endometrial cancer is briefly discussed. Some open problems are also presented.
△ Less
Submitted 24 February, 2014;
originally announced February 2014.
-
Near-Ideal Behavior of Compressed Sensing Algorithms
Authors:
Mehmet Eren Ahsen,
Mathukumalli Vidyasagar
Abstract:
In a recent paper, it is shown that the LASSO algorithm exhibits "near-ideal behavior," in the following sense: Suppose $y = Az + η$ where $A$ satisfies the restricted isometry property (RIP) with a sufficiently small constant, and $\Vert η\Vert_2 \leq ε$. Then minimizing $\Vert z \Vert_1$ subject to $\Vert y - Az \Vert_2 \leq ε$ leads to an estimate $\hat{x}$ whose error…
▽ More
In a recent paper, it is shown that the LASSO algorithm exhibits "near-ideal behavior," in the following sense: Suppose $y = Az + η$ where $A$ satisfies the restricted isometry property (RIP) with a sufficiently small constant, and $\Vert η\Vert_2 \leq ε$. Then minimizing $\Vert z \Vert_1$ subject to $\Vert y - Az \Vert_2 \leq ε$ leads to an estimate $\hat{x}$ whose error $\Vert \hat{x} - x \Vert_2$ is bounded by a universal constant times the error achieved by an "oracle" that knows the location of the nonzero components of $x$. In the world of optimization, the LASSO algorithm has been generalized in several directions such as the group LASSO, the sparse group LASSO, either without or with tree-structured overlap** groups, and most recently, the sorted LASSO. In this paper, it is shown that {\it any algorithm\/} exhibits near-ideal behavior in the above sense, provided only that (i) the norm used to define the sparsity index is "decomposable," (ii) the penalty norm that is minimized in an effort to enforce sparsity is "$γ$-decomposable," and (iii) a "compressibility condition" in terms of a group restricted isometry property is satisfied. Specifically, the group LASSO, and the sparse group LASSO (with some permissible overlap in the groups), as well as the sorted $\ell_1$-norm minimization all exhibit near-ideal behavior. Explicit bounds on the residual error are derived that contain previously known results as special cases.
△ Less
Submitted 20 April, 2014; v1 submitted 26 January, 2014;
originally announced January 2014.
-
An Elementary Derivation of the Large Deviation Rate Function for Finite State Markov Chains
Authors:
Mathukumalli Vidyasagar
Abstract:
Large deviation theory is a branch of probability theory that is devoted to a study of the "rate" at which empirical estimates of various quantities converge to their true values. The object of study in this paper is the rate at which estimates of the doublet frequencies of a Markov chain over a finite alphabet converge to their true values. In case the Markov process is actually an i.i.d.\ proces…
▽ More
Large deviation theory is a branch of probability theory that is devoted to a study of the "rate" at which empirical estimates of various quantities converge to their true values. The object of study in this paper is the rate at which estimates of the doublet frequencies of a Markov chain over a finite alphabet converge to their true values. In case the Markov process is actually an i.i.d.\ process, the rate function turns out to be the relative entropy (or Kullback-Leibler divergence) between the true and the estimated probability vectors. This result is a special case of a very general result known as Sanov's theorem and dates back to 1957. Moreover, since the introduction of the "method of types" by Csiszár and his co-workers during the 1980s, the proof of this version of Sanov's theorem has been "elementary," using some combinatorial arguments. However, when the i.i.d.\ process is replaced by a Markov process, the available proofs are far more complex. The main objective of this paper is therefore to present a first-principles derivation of the LDP for finite state Markov chains, using only simple combinatorial arguments (e.g.\ the method of types), thus gathering in one place various arguments and estimates that are scattered over the literature.
△ Less
Submitted 14 September, 2013;
originally announced September 2013.
-
Reverse Engineering Gene Interaction Networks Using the Phi-Mixing Coefficient
Authors:
Nitin Kumar Singh,
M. Eren Ahsen,
Shiva Mankala,
Hyun-Seok Kim,
Michael A. White,
M. Vidyasagar
Abstract:
Constructing gene interaction networks (GINs) from high-throughput gene expression data is an important and challenging problem in systems biology. Existing algorithms produce networks that either have undirected and unweighted edges, or else are constrained to contain no cycles, both of which are biologically unrealistic. In the present paper we propose a new algorithm, based on a concept from pr…
▽ More
Constructing gene interaction networks (GINs) from high-throughput gene expression data is an important and challenging problem in systems biology. Existing algorithms produce networks that either have undirected and unweighted edges, or else are constrained to contain no cycles, both of which are biologically unrealistic. In the present paper we propose a new algorithm, based on a concept from probability theory known as the phi-mixing coefficient, that produces networks whose edges are weighted and directed, and are permitted to contain cycles. Because there is no "ground truth" for genome-wide networks on a human scale, we analyzed the outcomes of several experiments on lung cancer, and matched the predictions from the inferred networks with experimental results. Specifically, we inferred three networks (NSCLC, Neuro-endocrine NSCLC plus SCLC, and normal) from the gene expression measurements of 157 lung cancer and 59 normal cell lines, compared with the outcomes of siRNA screening of 19,000+ genes on 11 NSCLC cell lines, and analyzed data from a ChIP-Seq experiment to determine putative downstream targets of the lineage specific oncogenic transcription factor ASCL1. The inferred networks displayed a scale-free or power law behavior between the degree of a node and the number of nodes with that degree. There was a strong correlation between the degree of a gene in the inferred NSCLC network and its essentiality for the survival of the cells. The inferred downstream neighborhood genes of ASCL1 in the SCLC network were significantly enriched by ChIP-Seq determined putative target genes, while no such enrichment was found in the inferred NSCLC network.
△ Less
Submitted 12 March, 2016; v1 submitted 20 August, 2012;
originally announced August 2012.
-
Mixing Coefficients Between Discrete and Real Random Variables: Computation and Properties
Authors:
Mehmet Eren Ahsen,
Mathukumalli Vidyasagar
Abstract:
In this paper we study the problem of estimating the alpha-, beta- and phi-mixing coefficients between two random variables, that can either assume values in a finite set or the set of real numbers. In either case, explicit closed-form formulas for the beta-mixing coefficient are already known. Therefore for random variables assuming values in a finite set, our contributions are two-fold: (i) In t…
▽ More
In this paper we study the problem of estimating the alpha-, beta- and phi-mixing coefficients between two random variables, that can either assume values in a finite set or the set of real numbers. In either case, explicit closed-form formulas for the beta-mixing coefficient are already known. Therefore for random variables assuming values in a finite set, our contributions are two-fold: (i) In the case of the alpha-mixing coefficient, we show that determining whether or not it exceeds a prespecified threshold is NP-complete, and provide efficiently computable upper and lower bounds. (ii) We derive an exact closed-form formula for the phi-mixing coefficient. Next, we prove analogs of the data-processing inequality from information theory for each of the three kinds of mixing coefficients. Then we move on to real-valued random variables, and show that by using percentile binning and allowing the number of bins to increase more slowly than the number of samples, we can generate empirical estimates that are consistent, i.e., converge to the true values as the number of samples approaches infinity.
△ Less
Submitted 3 July, 2013; v1 submitted 8 August, 2012;
originally announced August 2012.
-
A Metric Between Probability Distributions on Finite Sets of Different Cardinalities and Applications to Order Reduction
Authors:
Mathukumalli Vidyasagar
Abstract:
With increasing use of digital control it is natural to view control inputs and outputs as stochastic processes assuming values over finite alphabets rather than in a Euclidean space. As control over networks becomes increasingly common, data compression by reducing the size of the input and output alphabets without losing the fidelity of representation becomes relevant. This requires us to define…
▽ More
With increasing use of digital control it is natural to view control inputs and outputs as stochastic processes assuming values over finite alphabets rather than in a Euclidean space. As control over networks becomes increasingly common, data compression by reducing the size of the input and output alphabets without losing the fidelity of representation becomes relevant. This requires us to define a notion of distance between two stochastic processes assuming values in distinct sets, possibly of different cardinalities. If the two processes are i.i.d., then the problem becomes one of defining a metric between two probability distributions over distinct finite sets of possibly different cardinalities. This is the problem addressed in the present paper. A metric is defined in terms of a joint distribution on the product of the two sets, which has the two given distributions as its marginals, and has minimum entropy. Computing the metric exactly turns out to be NP-hard. Therefore an efficient greedy algorithm is presented for finding an upper bound on the distance. This problem also turns out to be NP-hard, so again a greedy algorithm is constructed for finding a suboptimal reduced order approximation. Taken together, all the results presented here permit the approximation of an i.i.d. process over a set of large cardinality by another i.i.d. process over a set of smaller cardinality. In future work, attempts will be made to extend this work to Markov processes over finite sets.
△ Less
Submitted 6 September, 2011; v1 submitted 22 April, 2011;
originally announced April 2011.
-
An Improved Bound on the VC-Dimension of Neural Networks with Polynomial Activation Functions
Authors:
J. Maurice Rojas,
M. Vidyasagar
Abstract:
In this note, we derive an improved upper bound for the VC-dimension of neural networks with polynomial activation functions. This improved bound is based on a result of Rojas on the number of connected components of a semi-algebraic set.
In this note, we derive an improved upper bound for the VC-dimension of neural networks with polynomial activation functions. This improved bound is based on a result of Rojas on the number of connected components of a semi-algebraic set.
△ Less
Submitted 1 February, 2002; v1 submitted 19 December, 2001;
originally announced December 2001.