Search | arXiv e-print repository

Asymptotics of the Sketched Pseudoinverse

Authors: Daniel LeJeune, Pratik Patil, Hamid Javadi, Richard G. Baraniuk, Ryan J. Tibshirani

Abstract: We take a random matrix theory approach to random sketching and show an asymptotic first-order equivalence of the regularized sketched pseudoinverse of a positive semidefinite matrix to a certain evaluation of the resolvent of the same matrix. We focus on real-valued regularization and extend previous results on an asymptotic equivalence of random matrices to the real setting, providing a precise… ▽ More We take a random matrix theory approach to random sketching and show an asymptotic first-order equivalence of the regularized sketched pseudoinverse of a positive semidefinite matrix to a certain evaluation of the resolvent of the same matrix. We focus on real-valued regularization and extend previous results on an asymptotic equivalence of random matrices to the real setting, providing a precise characterization of the equivalence even under negative regularization, including a precise characterization of the smallest nonzero eigenvalue of the sketched matrix, which may be of independent interest. We then further characterize the second-order equivalence of the sketched pseudoinverse. We also apply our results to the analysis of the sketch-and-project method and to sketched ridge regression. Lastly, we prove that these results generalize to asymptotically free sketching matrices, obtaining the resulting equivalence for orthogonal sketching matrices and comparing our results to several common sketches used in practice. △ Less

Submitted 6 October, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: 45 pages, 9 figures

MSC Class: 15B52; 46L54; 62J07

arXiv:2208.00579 [pdf, other]

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Authors: Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J. Osher, Bao Wang

Abstract: Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accurac… ▽ More Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy. △ Less

Submitted 31 July, 2022; originally announced August 2022.

Comments: 22 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:2110.07034

MSC Class: 65Pxx

arXiv:2203.03099 [pdf, other]

doi 10.1007/s00365-022-09601-5

Singular Value Perturbation and Deep Network Optimization

Authors: Rudolf H. Riedi, Randall Balestriero, Richard G. Baraniuk

Abstract: We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convol… ▽ More We develop new theoretical results on matrix perturbation to shed light on the impact of architecture on the performance of a deep network. In particular, we explain analytically what deep learning practitioners have long observed empirically: the parameters of some deep architectures (e.g., residual networks, ResNets, and Dense networks, DenseNets) are easier to optimize than others (e.g., convolutional networks, ConvNets). Building on our earlier work connecting deep networks with continuous piecewise-affine splines, we develop an exact local linear representation of a deep network layer for a family of modern deep networks that includes ConvNets at one end of a spectrum and ResNets, DenseNets, and other networks with skip connections at the other. For regression and classification tasks that optimize the squared-error loss, we show that the optimization loss surface of a modern deep network is piecewise quadratic in the parameters, with local shape governed by the singular values of a matrix that is a function of the local linear representation. We develop new perturbation results for how the singular values of matrices of this sort behave as we add a fraction of the identity and multiply by certain diagonal matrices. A direct application of our perturbation results explains analytically why a network with skip connections (such as a ResNet or DenseNet) is easier to optimize than a ConvNet: thanks to its more stable singular values and smaller condition number, the local loss surface of such a network is less erratic, less eccentric, and features local minima that are more accommodating to gradient-based optimization. Our results also shed new light on the impact of different nonlinear activation functions on a deep network's singular values, regardless of its architecture. △ Less

Submitted 5 December, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

Comments: Constr Approx (2022)

arXiv:2006.06919 [pdf, other]

MomentumRNN: Integrating Momentum into Recurrent Neural Networks

Authors: Tan M. Nguyen, Richard G. Baraniuk, Andrea L. Bertozzi, Stanley J. Osher, Bao Wang

Abstract: Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numeri… ▽ More Designing deep neural networks is an art that often involves an expensive search over candidate architectures. To overcome this for recurrent neural nets (RNNs), we establish a connection between the hidden state dynamics in an RNN and gradient descent (GD). We then integrate momentum into this framework and propose a new family of RNNs, called {\em MomentumRNNs}. We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. We study the momentum long-short term memory (MomentumLSTM) and verify its advantages in convergence speed and accuracy over its LSTM counterpart across a variety of benchmarks. We also demonstrate that MomentumRNN is applicable to many types of recurrent cells, including those in the state-of-the-art orthogonal RNNs. Finally, we show that other advanced momentum-based optimization methods, such as Adam and Nesterov accelerated gradients with a restart, can be easily incorporated into the MomentumRNN framework for designing new recurrent cells with even better performance. The code is available at https://github.com/minhtannguyen/MomentumRNN. △ Less

Submitted 11 October, 2020; v1 submitted 11 June, 2020; originally announced June 2020.

Comments: 21 pages, 11 figures, Accepted for publication at Advances in Neural Information Processing Systems (NeurIPS) 2020

MSC Class: 68T07 ACM Class: I.2

Journal ref: Advances in Neural Information Processing Systems (NeurIPS) 2020

arXiv:1511.01017 [pdf, ps, other]

Consistent Parameter Estimation for LASSO and Approximate Message Passing

Authors: Ali Mousavi, Arian Maleki, Richard G. Baraniuk

Abstract: We consider the problem of recovering a vector $β_o \in \mathbb{R}^p$ from $n$ random and noisy linear observations $y= Xβ_o + w$, where $X$ is the measurement matrix and $w$ is noise. The LASSO estimate is given by the solution to the optimization problem $\hatβ_λ = \arg \min_β \frac{1}{2} \|y-Xβ\|_2^2 + λ\| β\|_1$. Among the iterative algorithms that have been proposed for solving this optimizat… ▽ More We consider the problem of recovering a vector $β_o \in \mathbb{R}^p$ from $n$ random and noisy linear observations $y= Xβ_o + w$, where $X$ is the measurement matrix and $w$ is noise. The LASSO estimate is given by the solution to the optimization problem $\hatβ_λ = \arg \min_β \frac{1}{2} \|y-Xβ\|_2^2 + λ\| β\|_1$. Among the iterative algorithms that have been proposed for solving this optimization problem, approximate message passing (AMP) has attracted attention for its fast convergence. Despite significant progress in the theoretical analysis of the estimates of LASSO and AMP, little is known about their behavior as a function of the regularization parameter $λ$, or the thereshold parameters $τ^t$. For instance the following basic questions have not yet been studied in the literature: (i) How does the size of the active set $\|\hatβ^λ\|_0/p$ behave as a function of $λ$? (ii) How does the mean square error $\|\hatβ_λ - β_o\|_2^2/p$ behave as a function of $λ$? (iii) How does $\|β^t - β_o \|_2^2/p$ behave as a function of $τ^1, \ldots, τ^{t-1}$? Answering these questions will help in addressing practical challenges regarding the optimal tuning of $λ$ or $τ^1, τ^2, \ldots$. This paper answers these questions in the asymptotic setting and shows how these results can be employed in deriving simple and theoretically optimal approaches for tuning the parameters $τ^1, \ldots, τ^t$ for AMP or $λ$ for LASSO. It also explores the connection between the optimal tuning of the parameters of AMP and the optimal tuning of LASSO. △ Less

Submitted 4 November, 2015; v1 submitted 3 November, 2015; originally announced November 2015.

Comments: arXiv admin note: text overlap with arXiv:1309.5979

arXiv:1406.4175 [pdf, other]

From Denoising to Compressed Sensing

Authors: Christopher A. Metzler, Arian Maleki, Richard G. Baraniuk

Abstract: A denoising algorithm seeks to remove noise, errors, or perturbations from a signal. Extensive research has been devoted to this arena over the last several decades, and as a result, today's denoisers can effectively remove large amounts of additive white Gaussian noise. A compressed sensing (CS) reconstruction algorithm seeks to recover a structured signal acquired using a small number of randomi… ▽ More A denoising algorithm seeks to remove noise, errors, or perturbations from a signal. Extensive research has been devoted to this arena over the last several decades, and as a result, today's denoisers can effectively remove large amounts of additive white Gaussian noise. A compressed sensing (CS) reconstruction algorithm seeks to recover a structured signal acquired using a small number of randomized measurements. Typical CS reconstruction algorithms can be cast as iteratively estimating a signal from a perturbed observation. This paper answers a natural question: How can one effectively employ a generic denoiser in a CS reconstruction algorithm? In response, we develop an extension of the approximate message passing (AMP) framework, called Denoising-based AMP (D-AMP), that can integrate a wide class of denoisers within its iterations. We demonstrate that, when used with a high performance denoiser for natural images, D-AMP offers state-of-the-art CS recovery performance while operating tens of times faster than competing methods. We explain the exceptional performance of D-AMP by analyzing some of its theoretical features. A key element in D-AMP is the use of an appropriate Onsager correction term in its iterations, which coerces the signal perturbation at each iteration to be very close to the white Gaussian noise that denoisers are typically designed to remove. △ Less

Submitted 17 April, 2016; v1 submitted 16 June, 2014; originally announced June 2014.

arXiv:1404.4104 [pdf, other]

Sparse Bilinear Logistic Regression

Authors: Jianing V. Shi, Yangyang Xu, Richard G. Baraniuk

Abstract: In this paper, we introduce the concept of sparse bilinear logistic regression for decision problems involving explanatory variables that are two-dimensional matrices. Such problems are common in computer vision, brain-computer interfaces, style/content factorization, and parallel factor analysis. The underlying optimization problem is bi-convex; we study its solution and develop an efficient algo… ▽ More In this paper, we introduce the concept of sparse bilinear logistic regression for decision problems involving explanatory variables that are two-dimensional matrices. Such problems are common in computer vision, brain-computer interfaces, style/content factorization, and parallel factor analysis. The underlying optimization problem is bi-convex; we study its solution and develop an efficient algorithm based on block coordinate descent. We provide a theoretical guarantee for global convergence and estimate the asymptotical convergence rate using the Kurdyka-Łojasiewicz inequality. A range of experiments with simulated and real data demonstrate that sparse bilinear logistic regression outperforms current techniques in several important applications. △ Less

Submitted 15 April, 2014; originally announced April 2014.

Comments: 27 pages, 5 figures

MSC Class: 65K10; 68W40; 68Q32

arXiv:1404.3418 [pdf, ps, other]

Active Learning for Undirected Graphical Model Selection

Authors: Divyanshu Vats, Robert D. Nowak, Richard G. Baraniuk

Abstract: This paper studies graphical model selection, i.e., the problem of estimating a graph of statistical relationships among a collection of random variables. Conventional graphical model selection algorithms are passive, i.e., they require all the measurements to have been collected before processing begins. We propose an active learning algorithm that uses junction tree representations to adapt futu… ▽ More This paper studies graphical model selection, i.e., the problem of estimating a graph of statistical relationships among a collection of random variables. Conventional graphical model selection algorithms are passive, i.e., they require all the measurements to have been collected before processing begins. We propose an active learning algorithm that uses junction tree representations to adapt future measurements based on the information gathered from prior measurements. We prove that, under certain conditions, our active learning algorithm requires fewer scalar measurements than any passive algorithm to reliably estimate a graph. A range of numerical results validate our theory and demonstrates the benefits of active learning. △ Less

Submitted 13 April, 2014; originally announced April 2014.

Comments: AISTATS 2014

Journal ref: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) 2014, Reykjavik, Iceland. JMLR: W&CP volume 33

arXiv:1402.5584 [pdf, ps, other]

Path Thresholding: Asymptotically Tuning-Free High-Dimensional Sparse Regression

Authors: Divyanshu Vats, Richard G. Baraniuk

Abstract: In this paper, we address the challenging problem of selecting tuning parameters for high-dimensional sparse regression. We propose a simple and computationally efficient method, called path thresholding (PaTh), that transforms any tuning parameter-dependent sparse regression algorithm into an asymptotically tuning-free sparse regression algorithm. More specifically, we prove that, as the problem… ▽ More In this paper, we address the challenging problem of selecting tuning parameters for high-dimensional sparse regression. We propose a simple and computationally efficient method, called path thresholding (PaTh), that transforms any tuning parameter-dependent sparse regression algorithm into an asymptotically tuning-free sparse regression algorithm. More specifically, we prove that, as the problem size becomes large (in the number of variables and in the number of observations), PaTh performs accurate sparse regression, under appropriate conditions, without specifying a tuning parameter. In finite-dimensional settings, we demonstrate that PaTh can alleviate the computational burden of model selection algorithms by significantly reducing the search space of tuning parameters. △ Less

Submitted 23 February, 2014; originally announced February 2014.

Comments: AISTATS 2014

Journal ref: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) 2014, Reykjavik, Iceland. JMLR: W&CP volume 33

arXiv:1401.7715 [pdf, other]

Video Compressive Sensing for Dynamic MRI

Authors: Jianing V. Shi, Wotao Yin, Aswin C. Sankaranarayanan, Richard G. Baraniuk

Abstract: We present a video compressive sensing framework, termed kt-CSLDS, to accelerate the image acquisition process of dynamic magnetic resonance imaging (MRI). We are inspired by a state-of-the-art model for video compressive sensing that utilizes a linear dynamical system (LDS) to model the motion manifold. Given compressive measurements, the state sequence of an LDS can be first estimated using syst… ▽ More We present a video compressive sensing framework, termed kt-CSLDS, to accelerate the image acquisition process of dynamic magnetic resonance imaging (MRI). We are inspired by a state-of-the-art model for video compressive sensing that utilizes a linear dynamical system (LDS) to model the motion manifold. Given compressive measurements, the state sequence of an LDS can be first estimated using system identification techniques. We then reconstruct the observation matrix using a joint structured sparsity assumption. In particular, we minimize an objective function with a mixture of wavelet sparsity and joint sparsity within the observation matrix. We derive an efficient convex optimization algorithm through alternating direction method of multipliers (ADMM), and provide a theoretical guarantee for global convergence. We demonstrate the performance of our approach for video compressive sensing, in terms of reconstruction accuracy. We also investigate the impact of various sampling strategies. We apply this framework to accelerate the acquisition process of dynamic MRI and show it achieves the best reconstruction accuracy with the least computational time compared with existing algorithms in the literature. △ Less

Submitted 1 February, 2014; v1 submitted 29 January, 2014; originally announced January 2014.

Comments: 30 pages, 9 figures

MSC Class: 90-08; 90C25; 65P99; 65K10; 93E10; 93E12

arXiv:1312.5734 [pdf, ps, other]

Time-varying Learning and Content Analytics via Sparse Factor Analysis

Authors: Andrew S. Lan, Christoph Studer, Richard G. Baraniuk

Abstract: We propose SPARFA-Trace, a new machine learning-based framework for time-varying learning and content analytics for education applications. We develop a novel message passing-based, blind, approximate Kalman filter for sparse factor analysis (SPARFA), that jointly (i) traces learner concept knowledge over time, (ii) analyzes learner concept knowledge state transitions (induced by interacting with… ▽ More We propose SPARFA-Trace, a new machine learning-based framework for time-varying learning and content analytics for education applications. We develop a novel message passing-based, blind, approximate Kalman filter for sparse factor analysis (SPARFA), that jointly (i) traces learner concept knowledge over time, (ii) analyzes learner concept knowledge state transitions (induced by interacting with learning resources, such as textbook sections, lecture videos, etc, or the forgetting effect), and (iii) estimates the content organization and intrinsic difficulty of the assessment questions. These quantities are estimated solely from binary-valued (correct/incorrect) graded learner response data and a summary of the specific actions each learner performs (e.g., answering a question or studying a learning resource) at each time instance. Experimental results on two online course datasets demonstrate that SPARFA-Trace is capable of tracing each learner's concept knowledge evolution over time, as well as analyzing the quality and content organization of learning resources, the question-concept associations, and the question intrinsic difficulties. Moreover, we show that SPARFA-Trace achieves comparable or better performance in predicting unobserved learner responses than existing collaborative filtering and knowledge tracing approaches for personalized education. △ Less

Submitted 19 December, 2013; originally announced December 2013.

arXiv:1312.1706 [pdf, ps, other]

Swap** Variables for High-Dimensional Sparse Regression with Correlated Measurements

Authors: Divyanshu Vats, Richard G. Baraniuk

Abstract: We consider the high-dimensional sparse linear regression problem of accurately estimating a sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that the standard cadre of computationally tractable sparse regression algorithms---such as the Lasso, Orthogonal Matching Pursuit (OMP), and their extensions---perform poorly when the measurement mat… ▽ More We consider the high-dimensional sparse linear regression problem of accurately estimating a sparse vector using a small number of linear measurements that are contaminated by noise. It is well known that the standard cadre of computationally tractable sparse regression algorithms---such as the Lasso, Orthogonal Matching Pursuit (OMP), and their extensions---perform poorly when the measurement matrix contains highly correlated columns. To address this shortcoming, we develop a simple greedy algorithm, called SWAP, that iteratively swaps variables until convergence. SWAP is surprisingly effective in handling measurement matrices with high correlations. In fact, we prove that SWAP outputs the true support, the locations of the non-zero entries in the sparse vector, under a relatively mild condition on the measurement matrix. Furthermore, we show that SWAP can be used to boost the performance of any sparse regression algorithm. We empirically demonstrate the advantages of SWAP by comparing it with several state-of-the-art sparse regression algorithms. △ Less

Submitted 22 February, 2014; v1 submitted 5 December, 2013; originally announced December 2013.

Comments: Parts of this paper have appeared in NIPS 2013

arXiv:1311.0035 [pdf, ps, other]

Parameterless Optimal Approximate Message Passing

Authors: Ali Mousavi, Arian Maleki, Richard G. Baraniuk

Abstract: Iterative thresholding algorithms are well-suited for high-dimensional problems in sparse recovery and compressive sensing. The performance of this class of algorithms depends heavily on the tuning of certain threshold parameters. In particular, both the final reconstruction error and the convergence rate of the algorithm crucially rely on how the threshold parameter is set at each step of the alg… ▽ More Iterative thresholding algorithms are well-suited for high-dimensional problems in sparse recovery and compressive sensing. The performance of this class of algorithms depends heavily on the tuning of certain threshold parameters. In particular, both the final reconstruction error and the convergence rate of the algorithm crucially rely on how the threshold parameter is set at each step of the algorithm. In this paper, we propose a parameter-free approximate message passing (AMP) algorithm that sets the threshold parameter at each iteration in a fully automatic way without either having an information about the signal to be reconstructed or needing any tuning from the user. We show that the proposed method attains both the minimum reconstruction error and the highest convergence rate. Our method is based on applying the Stein unbiased risk estimate (SURE) along with a modified gradient descent to find the optimal threshold in each iteration. Motivated by the connections between AMP and LASSO, it could be employed to find the solution of the LASSO for the optimal regularization parameter. To the best of our knowledge, this is the first work concerning parameter tuning that obtains the fastest convergence rate with theoretical guarantees. △ Less

Submitted 31 October, 2013; originally announced November 2013.

arXiv:1309.5979 [pdf, other]

Asymptotic Analysis of LASSOs Solution Path with Implications for Approximate Message Passing

Authors: Ali Mousavi, Arian Maleki, Richard G. Baraniuk

Abstract: This paper concerns the performance of the LASSO (also knows as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal $x_o \in \mathbb{R}^N$ from $n$ random and noisy linear observations $y= Ax_o + w$, where $A$ is the measurement matrix and $w$ is the noise. The LASSO estimate is given by the solution to th… ▽ More This paper concerns the performance of the LASSO (also knows as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal $x_o \in \mathbb{R}^N$ from $n$ random and noisy linear observations $y= Ax_o + w$, where $A$ is the measurement matrix and $w$ is the noise. The LASSO estimate is given by the solution to the optimization problem $x_o$ with $\hat{x}_λ = \arg \min_x \frac{1}{2} \|y-Ax\|_2^2 + λ\|x\|_1$. Despite major progress in the theoretical analysis of the LASSO solution, little is known about its behavior as a function of the regularization parameter $λ$. In this paper we study two questions in the asymptotic setting (i.e., where $N \rightarrow \infty$, $n \rightarrow \infty$ while the ratio $n/N$ converges to a fixed number in $(0,1)$): (i) How does the size of the active set $\|\hat{x}_λ\|_0/N$ behave as a function of $λ$, and (ii) How does the mean square error $\|\hat{x}_λ - x_o\|_2^2/N$ behave as a function of $λ$? We then employ these results in a new, reliable algorithm for solving LASSO based on approximate message passing (AMP). △ Less

Submitted 23 September, 2013; originally announced September 2013.

arXiv:1303.5685 [pdf, ps, other]

Sparse Factor Analysis for Learning and Content Analytics

Authors: Andrew S. Lan, Andrew E. Waters, Christoph Studer, Richard G. Baraniuk

Abstract: We develop a new model and algorithms for machine learning-based learning analytics, which estimate a learner's knowledge of the concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and those concepts. Our model represents the probability that a learner provides the correct response to a question in terms of three factors: their unde… ▽ More We develop a new model and algorithms for machine learning-based learning analytics, which estimate a learner's knowledge of the concepts underlying a domain, and content analytics, which estimate the relationships among a collection of questions and those concepts. Our model represents the probability that a learner provides the correct response to a question in terms of three factors: their understanding of a set of underlying concepts, the concepts involved in each question, and each question's intrinsic difficulty. We estimate these factors given the graded responses to a collection of questions. The underlying estimation problem is ill-posed in general, especially when only a subset of the questions are answered. The key observation that enables a well-posed solution is the fact that typical educational domains of interest involve only a small number of key concepts. Leveraging this observation, we develop both a bi-convex maximum-likelihood and a Bayesian solution to the resulting SPARse Factor Analysis (SPARFA) problem. We also incorporate user-defined tags on questions to facilitate the interpretability of the estimated factors. Experiments with synthetic and real-world data demonstrate the efficacy of our approach. Finally, we make a connection between SPARFA and noisy, binary-valued (1-bit) dictionary learning that is of independent interest. △ Less

Submitted 19 July, 2013; v1 submitted 22 March, 2013; originally announced March 2013.

Journal ref: Journal of Machine Learning Research, vol. 15, pp. 1959-2008, June, 2014

arXiv:1303.4778 [pdf, other]

Greedy Feature Selection for Subspace Clustering

Authors: Eva L. Dyer, Aswin C. Sankaranarayanan, Richard G. Baraniuk

Abstract: Unions of subspaces provide a powerful generalization to linear subspace models for collections of high-dimensional data. To learn a union of subspaces from a collection of data, sets of signals in the collection that belong to the same subspace must be identified in order to obtain accurate estimates of the subspace structures present in the data. Recently, sparse recovery methods have been shown… ▽ More Unions of subspaces provide a powerful generalization to linear subspace models for collections of high-dimensional data. To learn a union of subspaces from a collection of data, sets of signals in the collection that belong to the same subspace must be identified in order to obtain accurate estimates of the subspace structures present in the data. Recently, sparse recovery methods have been shown to provide a provable and robust strategy for exact feature selection (EFS)--recovering subsets of points from the ensemble that live in the same subspace. In parallel with recent studies of EFS with L1-minimization, in this paper, we develop sufficient conditions for EFS with a greedy method for sparse signal recovery known as orthogonal matching pursuit (OMP). Following our analysis, we provide an empirical study of feature selection strategies for signals living on unions of subspaces and characterize the gap between sparse recovery methods and nearest neighbor (NN)-based approaches. In particular, we demonstrate that sparse recovery methods provide significant advantages over NN methods and the gap between the two approaches is particularly pronounced when the sampling of subspaces in the dataset is sparse. Our results suggest that OMP may be employed to reliably recover exact feature sets in a number of regimes where NN approaches fail to reveal the subspace membership of points in the ensemble. △ Less

Submitted 3 July, 2013; v1 submitted 19 March, 2013; originally announced March 2013.

Comments: 32 pages, 7 figures, 1 table

Journal ref: Journal of Machine Learning Research, Vol.14, Issue 1, pp. 2487-2517, January 2013

arXiv:1112.0311 [pdf, other]

Anisotropic Nonlocal Means Denoising

Authors: Arian Maleki, Manjari Narayan, Richard G. Baraniuk

Abstract: It has recently been proved that the popular nonlocal means (NLM) denoising algorithm does not optimally denoise images with sharp edges. Its weakness lies in the isotropic nature of the neighborhoods it uses to set its smoothing weights. In response, in this paper we introduce several theoretical and practical anisotropic nonlocal means (ANLM) algorithms and prove that they are near minimax optim… ▽ More It has recently been proved that the popular nonlocal means (NLM) denoising algorithm does not optimally denoise images with sharp edges. Its weakness lies in the isotropic nature of the neighborhoods it uses to set its smoothing weights. In response, in this paper we introduce several theoretical and practical anisotropic nonlocal means (ANLM) algorithms and prove that they are near minimax optimal for edge-dominated images from the Horizon class. On real-world test images, an ANLM algorithm that adapts to the underlying image gradients outperforms NLM by a significant margin. △ Less

Submitted 30 November, 2012; v1 submitted 30 November, 2011; originally announced December 2011.

Comments: Accepted for publication in Applied and Computational Harmonic Analysis (ACHA)

arXiv:1111.5867 [pdf, other]

Suboptimality of Nonlocal Means for Images with Sharp Edges

Authors: Arian Maleki, Manjari Narayan, Richard G. Baraniuk

Abstract: We conduct an asymptotic risk analysis of the nonlocal means image denoising algorithm for the Horizon class of images that are piecewise constant with a sharp edge discontinuity. We prove that the mean square risk of an optimally tuned nonlocal means algorithm decays according to $n^{-1}\log^{1/2+ε} n$, for an $n$-pixel image with $ε>0$. This decay rate is an improvement over some of the predeces… ▽ More We conduct an asymptotic risk analysis of the nonlocal means image denoising algorithm for the Horizon class of images that are piecewise constant with a sharp edge discontinuity. We prove that the mean square risk of an optimally tuned nonlocal means algorithm decays according to $n^{-1}\log^{1/2+ε} n$, for an $n$-pixel image with $ε>0$. This decay rate is an improvement over some of the predecessors of this algorithm, including the linear convolution filter, median filter, and the SUSAN filter, each of which provides a rate of only $n^{-2/3}$. It is also within a logarithmic factor from optimally tuned wavelet thresholding. However, it is still substantially lower than the the optimal minimax rate of $n^{-4/3}$. △ Less

Submitted 24 November, 2011; originally announced November 2011.

Comments: 33 pages, 3 figures

arXiv:0911.0736 [pdf, ps, other]

A simple proof that random matrices are democratic

Authors: Mark A. Davenport, Jason N. Laska, Petros T. Boufounos, Richard G. Baraniuk

Abstract: The recently introduced theory of compressive sensing (CS) enables the reconstruction of sparse or compressible signals from a small set of nonadaptive, linear measurements. If properly chosen, the number of measurements can be significantly smaller than the ambient dimension of the signal and yet preserve the significant signal information. Interestingly, it can be shown that random measurement… ▽ More The recently introduced theory of compressive sensing (CS) enables the reconstruction of sparse or compressible signals from a small set of nonadaptive, linear measurements. If properly chosen, the number of measurements can be significantly smaller than the ambient dimension of the signal and yet preserve the significant signal information. Interestingly, it can be shown that random measurement schemes provide a near-optimal encoding in terms of the required number of measurements. In this report, we explore another relatively unexplored, though often alluded to, advantage of using random matrices to acquire CS measurements. Specifically, we show that random matrices are democractic, meaning that each measurement carries roughly the same amount of signal information. We demonstrate that by slightly increasing the number of measurements, the system is robust to the loss of a small number of arbitrary measurements. In addition, we draw connections to oversampling and demonstrate stability from the loss of significantly more measurements. △ Less

Submitted 4 November, 2009; originally announced November 2009.

Report number: Rice University Department of Electrical and Computer Engineering Technical Report TREE0906 MSC Class: 41A46; 68W20; 90C27

arXiv:math/0611191 [pdf, ps, other]

doi 10.1214/074921706000000509

Optimal sampling strategies for multiscale stochastic processes

Authors: Vinay J. Ribeiro, Rudolf H. Riedi, Richard G. Baraniuk

Abstract: In this paper, we determine which non-random sampling of fixed size gives the best linear predictor of the sum of a finite spatial population. We employ different multiscale superpopulation models and use the minimum mean-squared error as our optimality criterion. In multiscale superpopulation tree models, the leaves represent the units of the population, interior nodes represent partial sums of… ▽ More In this paper, we determine which non-random sampling of fixed size gives the best linear predictor of the sum of a finite spatial population. We employ different multiscale superpopulation models and use the minimum mean-squared error as our optimality criterion. In multiscale superpopulation tree models, the leaves represent the units of the population, interior nodes represent partial sums of the population, and the root node represents the total sum of the population. We prove that the optimal sampling pattern varies dramatically with the correlation structure of the tree nodes. While uniform sampling is optimal for trees with ``positive correlation progression'', it provides the worst possible sampling with ``negative correlation progression.'' As an analysis tool, we introduce and study a class of independent innovations trees that are of interest in their own right. We derive a fast water-filling algorithm to determine the optimal sampling of the leaves to estimate the root of an independent innovations tree. △ Less

Submitted 7 November, 2006; originally announced November 2006.

Comments: Published at http://dx.doi.org/10.1214/074921706000000509 in the IMS Lecture Notes--Monograph Series (http://www.imstat.org/publications/lecnotes.htm) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-LNMS49-LNMS4916 MSC Class: 94A20; 62M30; 60G18 (Primary) 62H11; 62H12; 78M50 (Secondary)

Journal ref: IMS Lecture Notes--Monograph Series 2006, Vol. 49, 266-290

Showing 1–20 of 20 results for author: Baraniuk, R G