Search | arXiv e-print repository

arXiv:2006.01662 [pdf, other]

Tree-Projected Gradient Descent for Estimating Gradient-Sparse Parameters on Graphs

Authors: Sheng Xu, Zhou Fan, Sahand Negahban

Abstract: We study estimation of a gradient-sparse parameter vector $\boldsymbolθ^* \in \mathbb{R}^p$, having strong gradient-sparsity $s^*:=\|\nabla_G \boldsymbolθ^*\|_0$ on an underlying graph $G$. Given observations $Z_1,\ldots,Z_n$ and a smooth, convex loss function $\mathcal{L}$ for which $\boldsymbolθ^*$ minimizes the population risk $\mathbb{E}[\mathcal{L}(\boldsymbolθ;Z_1,\ldots,Z_n)]$, we propose t… ▽ More We study estimation of a gradient-sparse parameter vector $\boldsymbolθ^* \in \mathbb{R}^p$, having strong gradient-sparsity $s^*:=\|\nabla_G \boldsymbolθ^*\|_0$ on an underlying graph $G$. Given observations $Z_1,\ldots,Z_n$ and a smooth, convex loss function $\mathcal{L}$ for which $\boldsymbolθ^*$ minimizes the population risk $\mathbb{E}[\mathcal{L}(\boldsymbolθ;Z_1,\ldots,Z_n)]$, we propose to estimate $\boldsymbolθ^*$ by a projected gradient descent algorithm that iteratively and approximately projects gradient steps onto spaces of vectors having small gradient-sparsity over low-degree spanning trees of $G$. We show that, under suitable restricted strong convexity and smoothness assumptions for the loss, the resulting estimator achieves the squared-error risk $\frac{s^*}{n} \log (1+\frac{p}{s^*})$ up to a multiplicative constant that is independent of $G$. In contrast, previous polynomial-time algorithms have only been shown to achieve this guarantee in more specialized settings, or under additional assumptions for $G$ and/or the sparsity pattern of $\nabla_G \boldsymbolθ^*$. As applications of our general framework, we apply our results to the examples of linear models and generalized linear models with random design. △ Less

Submitted 31 May, 2020; originally announced June 2020.

arXiv:1912.01417 [pdf, other]

Distributed Machine Learning with Sparse Heterogeneous Data

Authors: Dominic Richards, Sahand N. Negahban, Patrick Rebeschini

Abstract: Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their… ▽ More Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a graph topology. Precisely, we analyse the case where each node is associated with fitting a sparse linear model, and edges join two nodes if the difference of their solutions is also sparse. We propose a method based on Basis Pursuit Denoising with a total variation penalty, and provide finite sample guarantees for sub-Gaussian design matrices. Taking the root of the tree as a reference node, we show that if the sparsity of the differences across nodes is smaller than the sparsity at the root, then recovery is successful with fewer samples than by solving the problems independently, or by using methods that rely on a large overlap in the signal supports, such as the group Lasso. We consider both the noiseless and noisy setting, and numerically investigate the performance of distributed methods based on Distributed Alternating Direction Methods of Multipliers (ADMM) and hyperspectral unmixing. △ Less

Submitted 27 November, 2021; v1 submitted 3 December, 2019; originally announced December 2019.

Comments: NeurIPS 2021 camera ready

arXiv:1901.00301 [pdf, other]

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

Authors: Chicheng Zhang, Alekh Agarwal, Hal Daumé III, John Langford, Sahand N Negahban

Abstract: We investigate the feasibility of learning from a mix of both fully-labeled supervised data and contextual bandit data. We specifically consider settings in which the underlying learning signal may be different between these two data sources. Theoretically, we state and prove no-regret algorithms for learning that is robust to misaligned cost distributions between the two sources. Empirically, we… ▽ More We investigate the feasibility of learning from a mix of both fully-labeled supervised data and contextual bandit data. We specifically consider settings in which the underlying learning signal may be different between these two data sources. Theoretically, we state and prove no-regret algorithms for learning that is robust to misaligned cost distributions between the two sources. Empirically, we evaluate some of these algorithms on a large selection of datasets, showing that our approach is both feasible and helpful in practice. △ Less

Submitted 21 June, 2019; v1 submitted 2 January, 2019; originally announced January 2019.

Comments: 42 pages, 21 figures, ICML 2019

arXiv:1810.09401 [pdf, other]

Alternating Linear Bandits for Online Matrix-Factorization Recommendation

Authors: Hamid Dadkhahi, Sahand Negahban

Abstract: We consider the problem of online collaborative filtering in the online setting, where items are recommended to the users over time. At each time step, the user (selected by the environment) consumes an item (selected by the agent) and provides a rating of the selected item. In this paper, we propose a novel algorithm for online matrix factorization recommendation that combines linear bandits and… ▽ More We consider the problem of online collaborative filtering in the online setting, where items are recommended to the users over time. At each time step, the user (selected by the environment) consumes an item (selected by the agent) and provides a rating of the selected item. In this paper, we propose a novel algorithm for online matrix factorization recommendation that combines linear bandits and alternating least squares. In this formulation, the bandit feedback is equal to the difference between the ratings of the best and selected items. We evaluate the performance of the proposed algorithm over time using both cumulative regret and average cumulative NDCG. Simulation results over three synthetic datasets as well as three real-world datasets for online collaborative filtering indicate the superior performance of the proposed algorithm over two state-of-the-art online algorithms. △ Less

Submitted 22 October, 2018; originally announced October 2018.

arXiv:1810.04247 [pdf, other]

Feature Selection using Stochastic Gates

Authors: Yutaro Yamada, Ofir Lindenbaum, Sahand Negahban, Yuval Kluger

Abstract: Feature selection problems have been extensively studied for linear estimation, for instance, Lasso, but less emphasis has been placed on feature selection for non-linear functions. In this study, we propose a method for feature selection in high-dimensional non-linear function estimation problems. The new procedure is based on minimizing the $\ell_0$ norm of the vector of indicator variables that… ▽ More Feature selection problems have been extensively studied for linear estimation, for instance, Lasso, but less emphasis has been placed on feature selection for non-linear functions. In this study, we propose a method for feature selection in high-dimensional non-linear function estimation problems. The new procedure is based on minimizing the $\ell_0$ norm of the vector of indicator variables that represent if a feature is selected or not. Our approach relies on the continuous relaxation of Bernoulli distributions, which allows our model to learn the parameters of the approximate Bernoulli distributions via gradient descent. This general framework simultaneously minimizes a loss function while selecting relevant features. Furthermore, we provide an information-theoretic justification of incorporating Bernoulli distribution into our approach and demonstrate the potential of the approach on synthetic and real-life applications. △ Less

Submitted 26 July, 2020; v1 submitted 9 October, 2018; originally announced October 2018.

Comments: Published in ICML 2020

Journal ref: Proceedings of Machine Learning and Systems 2020, pages 8952--8963

arXiv:1710.07006 [pdf, ps, other]

Minimax Estimation of Bandable Precision Matrices

Authors: Addison Hu, Sahand Negahban

Abstract: The inverse covariance matrix provides considerable insight for understanding statistical models in the multivariate setting. In particular, when the distribution over variables is assumed to be multivariate normal, the sparsity pattern in the inverse covariance matrix, commonly referred to as the precision matrix, corresponds to the adjacency matrix representation of the Gauss-Markov graph, which… ▽ More The inverse covariance matrix provides considerable insight for understanding statistical models in the multivariate setting. In particular, when the distribution over variables is assumed to be multivariate normal, the sparsity pattern in the inverse covariance matrix, commonly referred to as the precision matrix, corresponds to the adjacency matrix representation of the Gauss-Markov graph, which encodes conditional independence statements between variables. Minimax results under the spectral norm have previously been established for covariance matrices, both sparse and banded, and for sparse precision matrices. We establish minimax estimation bounds for estimating banded precision matrices under the spectral norm. Our results greatly improve upon the existing bounds; in particular, we find that the minimax rate for estimating banded precision matrices matches that of estimating banded covariance matrices. The key insight in our analysis is that we are able to obtain barely-noisy estimates of $k \times k$ subblocks of the precision matrix by inverting slightly wider blocks of the empirical covariance matrix along the diagonal. Our theoretical results are complemented by experiments demonstrating the sharpness of our bounds. △ Less

Submitted 19 October, 2017; originally announced October 2017.

arXiv:1704.07228 [pdf, other]

Learning from Comparisons and Choices

Authors: Sahand Negahban, Sewoong Oh, Kiran K. Thekumparampil, Jiaming Xu

Abstract: When tracking user-specific online activities, each user's preference is revealed in the form of choices and comparisons. For example, a user's purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. A user's preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. Given such individualized or… ▽ More When tracking user-specific online activities, each user's preference is revealed in the form of choices and comparisons. For example, a user's purchase history is a record of her choices, i.e. which item was chosen among a subset of offerings. A user's preferences can be observed either explicitly as in movie ratings or implicitly as in viewing times of news articles. Given such individualized ordinal data in the form of comparisons and choices, we address the problem of collaboratively learning representations of the users and the items. The learned features can be used to predict a user's preference of an unseen item to be used in recommendation systems. This also allows one to compute similarities among users and items to be used for categorization and search. Motivated by the empirical successes of the MultiNomial Logit (MNL) model in marketing and transportation, and also more recent successes in word embedding and crowdsourced image embedding, we pose this problem as learning the MNL model parameters that best explain the data. We propose a convex relaxation for learning the MNL model, and show that it is minimax optimal up to a logarithmic factor by comparing its performance to a fundamental lower bound. This characterizes the minimax sample complexity of the problem, and proves that the proposed estimator cannot be improved upon other than by a logarithmic factor. Further, the analysis identifies how the accuracy depends on the topology of sampling via the spectrum of the sampling graph. This provides a guideline for designing surveys when one can choose which items are to be compared. This is accompanied by numerical simulations on synthetic and real data sets, confirming our theoretical predictions. △ Less

Submitted 30 December, 2018; v1 submitted 24 April, 2017; originally announced April 2017.

Comments: 77 pages, 12 figures; added new experiments and references. arXiv admin note: substantial text overlap with arXiv:1506.07947

arXiv:1703.02723 [pdf, other]

Scalable Greedy Feature Selection via Weak Submodularity

Authors: Rajiv Khanna, Ethan Elenberg, Alexandros G. Dimakis, Sahand Negahban, Joydeep Ghosh

Abstract: Greedy algorithms are widely used for problems in machine learning such as feature selection and set function optimization. Unfortunately, for large datasets, the running time of even greedy algorithms can be quite high. This is because for each greedy step we need to refit a model or calculate a function using the previously selected choices and the new candidate. Two algorithms that are faster… ▽ More Greedy algorithms are widely used for problems in machine learning such as feature selection and set function optimization. Unfortunately, for large datasets, the running time of even greedy algorithms can be quite high. This is because for each greedy step we need to refit a model or calculate a function using the previously selected choices and the new candidate. Two algorithms that are faster approximations to the greedy forward selection were introduced recently ([Mirzasoleiman et al. 2013, 2015]). They achieve better performance by exploiting distributed computation and stochastic evaluation respectively. Both algorithms have provable performance guarantees for submodular functions. In this paper we show that divergent from previously held opinion, submodularity is not required to obtain approximation guarantees for these two algorithms. Specifically, we show that a generalized concept of weak submodularity suffices to give multiplicative approximation guarantees. Our result extends the applicability of these algorithms to a larger class of functions. Furthermore, we show that a bounded submodularity ratio can be used to provide data dependent bounds that can sometimes be tighter also for submodular functions. We empirically validate our work by showing superior performance of fast greedy approximations versus several established baselines on artificial and real datasets. △ Less

Submitted 8 March, 2017; originally announced March 2017.

Comments: To appear in AISTATS 2017

arXiv:1703.02721 [pdf, other]

On Approximation Guarantees for Greedy Low Rank Optimization

Authors: Rajiv Khanna, Ethan Elenberg, Alexandros G. Dimakis, Sahand Negahban

Abstract: We provide new approximation guarantees for greedy low rank matrix estimation under standard assumptions of restricted strong convexity and smoothness. Our novel analysis also uncovers previously unknown connections between the low rank estimation and combinatorial optimization, so much so that our bounds are reminiscent of corresponding approximation bounds in submodular maximization. Additionall… ▽ More We provide new approximation guarantees for greedy low rank matrix estimation under standard assumptions of restricted strong convexity and smoothness. Our novel analysis also uncovers previously unknown connections between the low rank estimation and combinatorial optimization, so much so that our bounds are reminiscent of corresponding approximation bounds in submodular maximization. Additionally, we also provide statistical recovery guarantees. Finally, we present empirical comparison of greedy estimation with established baselines on two important real-world problems. △ Less

Submitted 8 March, 2017; originally announced March 2017.

arXiv:1612.00804 [pdf, other]

Restricted Strong Convexity Implies Weak Submodularity

Authors: Ethan R. Elenberg, Rajiv Khanna, Alexandros G. Dimakis, Sahand Negahban

Abstract: We connect high-dimensional subset selection and submodular maximization. Our results extend the work of Das and Kempe (2011) from the setting of linear regression to arbitrary objective functions. For greedy feature selection, this connection allows us to obtain strong multiplicative performance bounds on several methods without statistical modeling assumptions. We also derive recovery guarantees… ▽ More We connect high-dimensional subset selection and submodular maximization. Our results extend the work of Das and Kempe (2011) from the setting of linear regression to arbitrary objective functions. For greedy feature selection, this connection allows us to obtain strong multiplicative performance bounds on several methods without statistical modeling assumptions. We also derive recovery guarantees of this form under standard assumptions. Our work shows that greedy algorithms perform within a constant factor from the best possible subset-selection solution for a broad class of general objective functions. Our methods allow a direct control over the number of obtained features as opposed to regularization parameters that only implicitly control sparsity. Our proof technique uses the concept of weak submodularity initially defined by Das and Kempe. We draw a connection between convex analysis and submodular set function theory which may be of independent interest for other statistical learning applications that have combinatorial structure. △ Less

Submitted 12 October, 2017; v1 submitted 2 December, 2016; originally announced December 2016.

arXiv:1610.09600 [pdf, other]

Super-resolution estimation of cyclic arrival rates

Authors: Ningyuan Chen, Donald K. K. Lee, Sahand Negahban

Abstract: Exploiting the fact that most arrival processes exhibit cyclic behaviour, we propose a simple procedure for estimating the intensity of a nonhomogeneous Poisson process. The estimator is the super-resolution analogue to Shao 2010 and Shao & Lii 2011, which is a sum of $p$ sinusoids where $p$ and the frequency, amplitude, and phase of each wave are not known and need to be estimated. This results i… ▽ More Exploiting the fact that most arrival processes exhibit cyclic behaviour, we propose a simple procedure for estimating the intensity of a nonhomogeneous Poisson process. The estimator is the super-resolution analogue to Shao 2010 and Shao & Lii 2011, which is a sum of $p$ sinusoids where $p$ and the frequency, amplitude, and phase of each wave are not known and need to be estimated. This results in an interpretable yet flexible specification that is suitable for use in modelling as well as in high resolution simulations. Our estimation procedure sits in between classic periodogram methods and atomic/total variation norm thresholding. Through a novel use of window functions in the point process domain, our approach attains super-resolution without semidefinite programming. Under suitable conditions, finite sample guarantees can be derived for our procedure. These resolve some open questions and expand existing results in spectral estimation literature. △ Less

Submitted 27 February, 2019; v1 submitted 29 October, 2016; originally announced October 2016.

Comments: 32 pages, 5 figures

MSC Class: 62M15; 90B22; 60G55

Journal ref: Annals of Statistics 47:3:1754-1775 (2019)

arXiv:1511.05432 [pdf, other]

doi 10.1016/j.neucom.2018.04.027

Understanding Adversarial Training: Increasing Local Stability of Neural Nets through Robust Optimization

Authors: Uri Shaham, Yutaro Yamada, Sahand Negahban

Abstract: We propose a general framework for increasing local stability of Artificial Neural Nets (ANNs) using Robust Optimization (RO). We achieve this through an alternating minimization-maximization procedure, in which the loss of the network is minimized over perturbed examples that are generated at each parameter update. We show that adversarial training of ANNs is in fact robustification of the networ… ▽ More We propose a general framework for increasing local stability of Artificial Neural Nets (ANNs) using Robust Optimization (RO). We achieve this through an alternating minimization-maximization procedure, in which the loss of the network is minimized over perturbed examples that are generated at each parameter update. We show that adversarial training of ANNs is in fact robustification of the network optimization, and that our proposed framework generalizes previous approaches for increasing local stability of ANNs. Experimental results reveal that our approach increases the robustness of the network to existing adversarial examples, while making it harder to generate new ones. Furthermore, our algorithm improves the accuracy of the network also on the original test data. △ Less

Submitted 16 January, 2016; v1 submitted 17 November, 2015; originally announced November 2015.

arXiv:1410.0860 [pdf, ps, other]

Individualized Rank Aggregation using Nuclear Norm Regularization

Authors: Yu Lu, Sahand N. Negahban

Abstract: In recent years rank aggregation has received significant attention from the machine learning community. The goal of such a problem is to combine the (partially revealed) preferences over objects of a large population into a single, relatively consistent ordering of those objects. However, in many cases, we might not want a single ranking and instead opt for individual rankings. We study a version… ▽ More In recent years rank aggregation has received significant attention from the machine learning community. The goal of such a problem is to combine the (partially revealed) preferences over objects of a large population into a single, relatively consistent ordering of those objects. However, in many cases, we might not want a single ranking and instead opt for individual rankings. We study a version of the problem known as collaborative ranking. In this problem we assume that individual users provide us with pairwise preferences (for example purchasing one item over another). From those preferences we wish to obtain rankings on items that the users have not had an opportunity to explore. The results here have a very interesting connection to the standard matrix completion problem. We provide a theoretical justification for a nuclear norm regularized optimization procedure, and provide high-dimensional scaling results that show how the error in estimating user preferences behaves as the number of observations increase. △ Less

Submitted 3 October, 2014; originally announced October 2014.

arXiv:1209.3775 [pdf, other]

doi 10.1093/mnras/stt1306

Using Machine Learning for Discovery in Synoptic Survey Imaging

Authors: Henrik Brink, Joseph W. Richards, Dovi Poznanski, Joshua S. Bloom, John Rice, Sahand Negahban, Martin Wainwright

Abstract: Modern time-domain surveys continuously monitor large swaths of the sky to look for astronomical variability. Astrophysical discovery in such data sets is complicated by the fact that detections of real transient and variable sources are highly outnumbered by bogus detections caused by imperfect subtractions, atmospheric effects and detector artefacts. In this work we present a machine learning (M… ▽ More Modern time-domain surveys continuously monitor large swaths of the sky to look for astronomical variability. Astrophysical discovery in such data sets is complicated by the fact that detections of real transient and variable sources are highly outnumbered by bogus detections caused by imperfect subtractions, atmospheric effects and detector artefacts. In this work we present a machine learning (ML) framework for discovery of variability in time-domain imaging surveys. Our ML methods provide probabilistic statements, in near real time, about the degree to which each newly observed source is astrophysically relevant source of variable brightness. We provide details about each of the analysis steps involved, including compilation of the training and testing sets, construction of descriptive image-based and contextual features, and optimization of the feature subset and model tuning parameters. Using a validation set of nearly 30,000 objects from the Palomar Transient Factory, we demonstrate a missed detection rate of at most 7.7% at our chosen false-positive rate of 1% for an optimized ML classifier of 23 features, selected to avoid feature correlation and over-fitting from an initial library of 42 attributes. Importantly, we show that our classification methodology is insensitive to mis-labelled training data up to a contamination of nearly 10%, making it easier to compile sufficient training sets for accurate performance in future surveys. This ML framework, if so adopted, should enable the maximization of scientific gain from future synoptic survey and enable fast follow-up decisions on the vast amounts of streaming data produced by such experiments. △ Less

Submitted 17 September, 2012; originally announced September 2012.

Comments: 16 pages, 14 figures

arXiv:1209.1688 [pdf, other]

Rank Centrality: Ranking from Pair-wise Comparisons

Authors: Sahand Negahban, Sewoong Oh, Devavrat Shah

Abstract: The question of aggregating pair-wise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR's TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining a ranking, finding `scores' for each objec… ▽ More The question of aggregating pair-wise comparisons to obtain a global ranking over a collection of objects has been of interest for a very long time: be it ranking of online gamers (e.g. MSR's TrueSkill system) and chess players, aggregating social opinions, or deciding which product to sell based on transactions. In most settings, in addition to obtaining a ranking, finding `scores' for each object (e.g. player's rating) is of interest for understanding the intensity of the preferences. In this paper, we propose Rank Centrality, an iterative rank aggregation algorithm for discovering scores for objects (or items) from pair-wise comparisons. The algorithm has a natural random walk interpretation over the graph of objects with an edge present between a pair of objects if they are compared; the score, which we call Rank Centrality, of an object turns out to be its stationary probability under this random walk. To study the efficacy of the algorithm, we consider the popular Bradley-Terry-Luce (BTL) model (equivalent to the Multinomial Logit (MNL) for pair-wise comparisons) in which each object has an associated score which determines the probabilistic outcomes of pair-wise comparisons between objects. In terms of the pair-wise marginal probabilities, which is the main subject of this paper, the MNL model and the BTL model are identical. We bound the finite sample error rates between the scores assumed by the BTL model and those estimated by our algorithm. In particular, the number of samples required to learn the score well with high probability depends on the structure of the comparison graph. When the Laplacian of the comparison graph has a strictly positive spectral gap, e.g. each item is compared to a subset of randomly chosen items, this leads to dependence on the number of samples that is nearly order-optimal. △ Less

Submitted 12 November, 2015; v1 submitted 8 September, 2012; originally announced September 2012.

Comments: 45 pages, 3 figures

arXiv:1208.1860 [pdf, other]

Scaling Multiple-Source Entity Resolution using Statistically Efficient Transfer Learning

Authors: Sahand Negahban, Benjamin I. P. Rubinstein, Jim Gemmell

Abstract: We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprece… ▽ More We consider a serious, previously-unexplored challenge facing almost all approaches to scaling up entity resolution (ER) to multiple data sources: the prohibitive cost of labeling training data for supervised learning of similarity scores for each pair of sources. While there exists a rich literature describing almost all aspects of pairwise ER, this new challenge is arising now due to the unprecedented ability to acquire and store data from online sources, features driven by ER such as enriched search verticals, and the uniqueness of noisy and missing data characteristics for each source. We show on real-world and synthetic data that for state-of-the-art techniques, the reality of heterogeneous sources means that the number of labeled training data must scale quadratically in the number of sources, just to maintain constant precision/recall. We address this challenge with a brand new transfer learning algorithm which requires far less training data (or equivalently, achieves superior accuracy with the same data) and is trained using fast convex optimization. The intuition behind our approach is to adaptively share structure learned about one scoring problem with all other scoring problems sharing a data source in common. We demonstrate that our theoretically motivated approach incurs no runtime cost while it can maintain constant precision/recall with the cost of labeling increasing only linearly with the number of sources. △ Less

Submitted 9 August, 2012; originally announced August 2012.

Comments: Short version to appear in CIKM'2012; 10 pages, 7 figures

ACM Class: H.2; I.2.6; I.5.4

arXiv:1207.4421 [pdf, ps, other]

Stochastic optimization and sparse statistical recovery: An optimal algorithm for high dimensions

Authors: Alekh Agarwal, Sahand Negahban, Martin J. Wainwright

Abstract: We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding an $\order(\pdim/T)$ convergence rate for strongly convex objectives in $\pdim$ dimensions, and an $\order(\sqrt{(\spindex \log \pdim)/T})$ convergence rate… ▽ More We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding an $\order(\pdim/T)$ convergence rate for strongly convex objectives in $\pdim$ dimensions, and an $\order(\sqrt{(\spindex \log \pdim)/T})$ convergence rate when the optimum is $\spindex$-sparse. Our algorithm is based on successively solving a series of $\ell_1$-regularized optimization problems using Nesterov's dual averaging algorithm. We establish that the error of our solution after $T$ iterations is at most $\order((\spindex \log\pdim)/T)$, with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to multiplicative constant factors. The effectiveness of our approach is also confirmed in numerical simulations, in which we compare to several baselines on a least-squares regression problem. △ Less

Submitted 18 July, 2012; originally announced July 2012.

Comments: 2 figures

arXiv:1104.4824 [pdf, ps, other]

Fast global convergence of gradient methods for high-dimensional statistical recovery

Authors: Alekh Agarwal, Sahand N. Negahban, Martin J. Wainwright

Abstract: Many statistical $M$-estimators are based on convex optimization problems formed by the combination of a data-dependent loss function with a norm-based regularizer. We analyze the convergence rates of projected gradient and composite gradient methods for solving such problems, working within a high-dimensional framework that allows the data dimension $\pdim$ to grow with (and possibly exceed) the… ▽ More Many statistical $M$-estimators are based on convex optimization problems formed by the combination of a data-dependent loss function with a norm-based regularizer. We analyze the convergence rates of projected gradient and composite gradient methods for solving such problems, working within a high-dimensional framework that allows the data dimension $\pdim$ to grow with (and possibly exceed) the sample size $\numobs$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie much of classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that projected gradient descent has a globally geometric rate of convergence up to the \emph{statistical precision} of the model, meaning the typical distance between the true unknown parameter $θ^*$ and an optimal solution $\hatθ$. This result is substantially sharper than previous convergence results, which yielded sublinear convergence, or linear convergence only up to the noise level. Our analysis applies to a wide range of $M$-estimators and statistical models, including sparse linear regression using Lasso ($\ell_1$-regularized regression); group Lasso for block sparsity; log-linear models with regularization; low-rank matrix recovery using nuclear norm regularization; and matrix decomposition. Overall, our analysis reveals interesting connections between statistical precision and computational efficiency in high-dimensional estimation. △ Less

Submitted 25 July, 2012; v1 submitted 25 April, 2011; originally announced April 2011.

arXiv:1102.4807 [pdf, ps, other]

doi 10.1214/12-AOS1000

Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions

Authors: Alekh Agarwal, Sahand N. Negahban, Martin J. Wainwright

Abstract: We analyze a class of estimators based on convex relaxation for solving high-dimensional matrix decomposition problems. The observations are noisy realizations of a linear transformation $\mathfrak{X}$ of the sum of an approximately) low rank matrix $Θ^\star$ with a second matrix $Γ^\star$ endowed with a complementary form of low-dimensional structure; this set-up includes many statistical models… ▽ More We analyze a class of estimators based on convex relaxation for solving high-dimensional matrix decomposition problems. The observations are noisy realizations of a linear transformation $\mathfrak{X}$ of the sum of an approximately) low rank matrix $Θ^\star$ with a second matrix $Γ^\star$ endowed with a complementary form of low-dimensional structure; this set-up includes many statistical models of interest, including factor analysis, multi-task regression, and robust covariance estimation. We derive a general theorem that bounds the Frobenius norm error for an estimate of the pair $(Θ^\star, Γ^\star)$ obtained by solving a convex optimization problem that combines the nuclear norm with a general decomposable regularizer. Our results utilize a "spikiness" condition that is related to but milder than singular vector incoherence. We specialize our general result to two cases that have been studied in past work: low rank plus an entrywise sparse matrix, and low rank plus a columnwise sparse matrix. For both models, our theory yields non-asymptotic Frobenius error bounds for both deterministic and stochastic noise matrices, and applies to matrices $Θ^\star$ that can be exactly or approximately low rank, and matrices $Γ^\star$ that can be exactly or approximately sparse. Moreover, for the case of stochastic noise matrices and the identity observation operator, we establish matching lower bounds on the minimax error. The sharpness of our predictions is confirmed by numerical simulations. △ Less

Submitted 6 March, 2012; v1 submitted 23 February, 2011; originally announced February 2011.

Comments: 41 pages, 2 figures

Report number: IMS-AOS-AOS1000 MSC Class: 62F30; 62F30 (Primary) 62H12 (Secondary)

Journal ref: Annals of Statistics 2012, Vol. 40, No. 2, 1171-1197

arXiv:1010.2731 [pdf, ps, other]

doi 10.1214/12-STS400

A Unified Framework for High-Dimensional Analysis of M-Estimators with Decomposable Regularizers

Authors: Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, Bin Yu

Abstract: High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless $p/n\rightarrow0$, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse and structured matrices, low-rank matric… ▽ More High-dimensional statistical inference deals with models in which the the number of parameters p is comparable to or larger than the sample size n. Since it is usually impossible to obtain consistent procedures unless $p/n\rightarrow0$, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse and structured matrices, low-rank matrices and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. This paper provides a unified framework for establishing consistency and convergence rates for such regularized M-estimators under high-dimensional scaling. We state one main theorem and show how it can be used to re-derive some existing results, and also to obtain a number of new results on consistency and convergence rates, in both $\ell_2$-error and related norms. Our analysis also identifies two key properties of loss and regularization functions, referred to as restricted strong convexity and decomposability, that ensure corresponding regularized M-estimators have fast convergence rates and which are optimal in many well-studied cases. △ Less

Submitted 12 March, 2013; v1 submitted 13 October, 2010; originally announced October 2010.

Comments: Published in at http://dx.doi.org/10.1214/12-STS400 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS400

Journal ref: Statistical Science 2012, Vol. 27, No. 4, 538-557

arXiv:1009.2118 [pdf, ps, other]

Restricted strong convexity and weighted matrix completion: Optimal bounds with noise

Authors: Sahand Negahban, Martin J. Wainwright

Abstract: We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a n… ▽ More We consider the matrix completion problem under a form of row/column weighted entrywise sampling, including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the "spikiness" and "low-rankness" of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an $M$-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with $\ell_q$-"balls" of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal. △ Less

Submitted 15 May, 2011; v1 submitted 10 September, 2010; originally announced September 2010.

arXiv:0912.5100 [pdf, ps, other]

Estimation of (near) low-rank matrices with noise and high-dimensional scaling

Authors: Sahand Negahban, Martin J. Wainwright

Abstract: High-dimensional inference refers to problems of statistical estimation in which the ambient dimension of the data may be comparable to or possibly even larger than the sample size. We study an instance of high-dimensional inference in which the goal is to estimate a matrix $Θ^* \in \real^{k \times p}$ on the basis of $N$ noisy observations, and the unknown matrix $Θ^*$ is assumed to be either e… ▽ More High-dimensional inference refers to problems of statistical estimation in which the ambient dimension of the data may be comparable to or possibly even larger than the sample size. We study an instance of high-dimensional inference in which the goal is to estimate a matrix $Θ^* \in \real^{k \times p}$ on the basis of $N$ noisy observations, and the unknown matrix $Θ^*$ is assumed to be either exactly low rank, or ``near'' low-rank, meaning that it can be well-approximated by a matrix with low rank. We consider an $M$-estimator based on regularization by the trace or nuclear norm over matrices, and analyze its performance under high-dimensional scaling. We provide non-asymptotic bounds on the Frobenius norm error that hold for a general class of noisy observation models, and then illustrate their consequences for a number of specific matrix models, including low-rank multivariate or multi-task regression, system identification in vector autoregressive processes, and recovery of low-rank matrices from random projections. Simulation results show excellent agreement with the high-dimensional scaling of the error predicted by our theory. △ Less

Submitted 27 December, 2009; originally announced December 2009.

Comments: Appeared as Stat. technical report, UC Berkeley

arXiv:0905.0642 [pdf, ps, other]

Simultaneous support recovery in high dimensions: Benefits and perils of block $\ell_1/\ell_\infty$-regularization

Authors: S. Negahban, M. J. Wainwright

Abstract: Consider the use of $\ell_{1}/\ell_{\infty}$-regularized regression for joint estimation of a $\pdim \times \numreg$ matrix of regression coefficients. We analyze the high-dimensional scaling of $\ell_1/\ell_\infty$-regularized quadratic programming, considering both consistency in $\ell_\infty$-norm, and variable selection. We begin by establishing bounds on the $\ell_\infty$-error as well suff… ▽ More Consider the use of $\ell_{1}/\ell_{\infty}$-regularized regression for joint estimation of a $\pdim \times \numreg$ matrix of regression coefficients. We analyze the high-dimensional scaling of $\ell_1/\ell_\infty$-regularized quadratic programming, considering both consistency in $\ell_\infty$-norm, and variable selection. We begin by establishing bounds on the $\ell_\infty$-error as well sufficient conditions for exact variable selection for fixed and random designs. Our second set of results applies to $\numreg = 2$ linear regression problems with standard Gaussian designs whose supports overlap in a fraction $α\in [0,1]$ of their entries: for this problem class, we prove that the $\ell_{1}/\ell_{\infty}$-regularized method undergoes a phase transition--that is, a sharp change from failure to success--characterized by the rescaled sample size $θ_{1,\infty}(n, p, s, α) = n/\{(4 - 3 α) s \log(p-(2- α) s)\}$. An implication of this threshold is that use of $\ell_1 / \ell_{\infty}$-regularization yields improved statistical efficiency if the overlap parameter is large enough ($α> 2/3$), but has \emph{worse} statistical efficiency than a naive Lasso-based approach for moderate to small overlap ($α< 2/3$). These results indicate that some caution needs to be exercised in the application of $\ell_1/\ell_\infty$ block regularization: if the data does not match its structure closely enough, it can impair statistical performance relative to computationally less expensive schemes. △ Less

Submitted 5 May, 2009; originally announced May 2009.

Comments: Presented in part at NIPS 2008 conference, Vancouver, Canada, December 2008

Showing 1–23 of 23 results for author: Negahban, S