Search | arXiv e-print repository

Corruption-tolerant Algorithms for Generalized Linear Models

Authors: Bhaskar P Mukhoty, Debojyoti Dey, Purushottam Kar

Abstract: This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel v… ▽ More This paper presents SVAM (Sequential Variance-Altered MLE), a unified framework for learning generalized linear models under adversarial label corruption in training data. SVAM extends to tasks such as least squares regression, logistic regression, and gamma regression, whereas many existing works on learning with label corruptions focus only on least squares regression. SVAM is based on a novel variance reduction technique that may be of independent interest and works by iteratively solving weighted MLEs over variance-altered versions of the GLM objective. SVAM offers provable model recovery guarantees superior to the state-of-the-art for robust regression even when a constant fraction of training labels are adversarially corrupted. SVAM also empirically outperforms several existing problem-specific techniques for robust regression and classification. Code for SVAM is available at https://github.com/purushottamkar/svam/ △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: 46 pages, 5 figures, to appear in the 31st AAAI Conference on Artificial Intelligence (AAAI), 2023

arXiv:2111.03932 [pdf, other]

AGGLIO: Global Optimization for Locally Convex Functions

Authors: Debojyoti Dey, Bhaskar Mukhoty, Purushottam Kar

Abstract: This paper presents AGGLIO (Accelerated Graduated Generalized LInear-model Optimization), a stage-wise, graduated optimization technique that offers global convergence guarantees for non-convex optimization problems whose objectives offer only local convexity and may fail to be even quasi-convex at a global scale. In particular, this includes learning problems that utilize popular activation funct… ▽ More This paper presents AGGLIO (Accelerated Graduated Generalized LInear-model Optimization), a stage-wise, graduated optimization technique that offers global convergence guarantees for non-convex optimization problems whose objectives offer only local convexity and may fail to be even quasi-convex at a global scale. In particular, this includes learning problems that utilize popular activation functions such as sigmoid, softplus and SiLU that yield non-convex training objectives. AGGLIO can be readily implemented using point as well as mini-batch SGD updates and offers provable convergence to the global optimum in general conditions. In experiments, AGGLIO outperformed several recently proposed optimization techniques for non-convex and locally convex objectives in terms of convergence rate as well as convergent accuracy. AGGLIO relies on a graduation technique for generalized linear models, as well as a novel proof strategy, both of which may be of independent interest. △ Less

Submitted 6 November, 2021; originally announced November 2021.

Comments: 33 pages, 7 figures, to appear at 9th ACM IKDD Conference on Data Science (CODS) 2022. Code for AGGLIO is available at https://github.com/purushottamkar/agglio/

arXiv:2105.10190 [pdf, other]

AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks

Authors: S. K. Roy, M. E. Paoletti, J. M. Haut, S. R. Dubey, P. Kar, A. Plaza, B. B. Chaudhuri

Abstract: Convolutional neural networks (CNNs) are trained using stochastic gradient descent (SGD)-based optimizers. Recently, the adaptive moment estimation (Adam) optimizer has become very popular due to its adaptive momentum, which tackles the dying gradient problem of SGD. Nevertheless, existing optimizers are still unable to exploit the optimization curvature information efficiently. This paper propose… ▽ More Convolutional neural networks (CNNs) are trained using stochastic gradient descent (SGD)-based optimizers. Recently, the adaptive moment estimation (Adam) optimizer has become very popular due to its adaptive momentum, which tackles the dying gradient problem of SGD. Nevertheless, existing optimizers are still unable to exploit the optimization curvature information efficiently. This paper proposes a new AngularGrad optimizer that considers the behavior of the direction/angle of consecutive gradients. This is the first attempt in the literature to exploit the gradient angular information apart from its magnitude. The proposed AngularGrad generates a score to control the step size based on the gradient angular information of previous iterations. Thus, the optimization steps become smoother as a more accurate step size of immediate past gradients is captured through the angular information. Two variants of AngularGrad are developed based on the use of Tangent or Cosine functions for computing the gradient angular information. Theoretically, AngularGrad exhibits the same regret bound as Adam for convergence purposes. Nevertheless, extensive experiments conducted on benchmark data sets against state-of-the-art methods reveal a superior performance of AngularGrad. The source code will be made publicly available at: https://github.com/mhaut/AngularGrad. △ Less

Submitted 9 September, 2023; v1 submitted 21 May, 2021; originally announced May 2021.

arXiv:2006.14211 [pdf, other]

Globally-convergent Iteratively Reweighted Least Squares for Robust Regression Problems

Authors: Bhaskar Mukhoty, Govind Gopakumar, Prateek Jain, Purushottam Kar

Abstract: We provide the first global model recovery results for the IRLS (iteratively reweighted least squares) heuristic for robust regression problems. IRLS is known to offer excellent performance, despite bad initializations and data corruption, for several parameter estimation problems. Existing analyses of IRLS frequently require careful initialization, thus offering only local convergence guarantees.… ▽ More We provide the first global model recovery results for the IRLS (iteratively reweighted least squares) heuristic for robust regression problems. IRLS is known to offer excellent performance, despite bad initializations and data corruption, for several parameter estimation problems. Existing analyses of IRLS frequently require careful initialization, thus offering only local convergence guarantees. We remedy this by proposing augmentations to the basic IRLS routine that not only offer guaranteed global recovery, but in practice also outperform state-of-the-art algorithms for robust regression. Our routines are more immune to hyperparameter misspecification in basic regression tasks, as well as applied tasks such as linear-armed bandit problems. Our theoretical analyses rely on a novel extension of the notions of strong convexity and smoothness to weighted strong convexity and smoothness, and establishing that sub-Gaussian designs offer bounded weighted condition numbers. These notions may be useful in analyzing other algorithms as well. △ Less

Submitted 25 June, 2020; originally announced June 2020.

Comments: 30 pages, 5 figures, appeared as a publication in the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019

Journal ref: Proceedings of Machine Learning Research (PMLR) 89:313-322, 2019

arXiv:2005.14015 [pdf, other]

MACER: A Modular Framework for Accelerated Compilation Error Repair

Authors: Darshak Chhatbar, Umair Z. Ahmed, Purushottam Kar

Abstract: Automated compilation error repair, the problem of suggesting fixes to buggy programs that fail to compile, has generated significant interest in recent years. Apart from being a tool of general convenience, automated code repair has significant pedagogical applications for novice programmers who find compiler error messages cryptic and unhelpful. Existing approaches largely solve this problem usi… ▽ More Automated compilation error repair, the problem of suggesting fixes to buggy programs that fail to compile, has generated significant interest in recent years. Apart from being a tool of general convenience, automated code repair has significant pedagogical applications for novice programmers who find compiler error messages cryptic and unhelpful. Existing approaches largely solve this problem using a blackbox-application of a heavy-duty generative learning technique, such as sequence-to-sequence prediction (TRACER) or reinforcement learning (RLAssist). Although convenient, such black-box application of learning techniques makes existing approaches bulky in terms of training time, as well as inefficient at targeting specific error types. We present MACER, a novel technique for accelerated error repair based on a modular segregation of the repair process into repair identification and repair application. MACER uses powerful yet inexpensive discriminative learning techniques such as multi-label classifiers and rankers to first identify the type of repair required and then apply the suggested repair. Experiments indicate that the fine-grained approach adopted by MACER offers not only superior error correction, but also much faster training and prediction. On a benchmark dataset of 4K buggy programs collected from actual student submissions, MACER outperforms existing methods by 20% at suggesting fixes for popular errors that exactly match the fix desired by the student. MACER is also competitive or better than existing methods at all error types -- whether popular or rare. MACER offers a training time speedup of 2x over TRACER and 800x over RLAssist, and a test time speedup of 2-4x over both. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: 19 pages, 9 figures. A short version of this paper will appear at the 21st International Conference on Artificial Intelligence in Education (AIED). Code for the MACER tool-chain is available at https://github.com/purushottamkar/macer/

arXiv:2005.11257 [pdf, other]

Epidemiologically and Socio-economically Optimal Policies via Bayesian Optimization

Authors: Amit Chandak, Debojyoti Dey, Bhaskar Mukhoty, Purushottam Kar

Abstract: Mass public quarantining, colloquially known as a lock-down, is a non-pharmaceutical intervention to check spread of disease. This paper presents ESOP (Epidemiologically and Socio-economically Optimal Policies), a novel application of active machine learning techniques using Bayesian optimization, that interacts with an epidemiological model to arrive at lock-down schedules that optimally balance… ▽ More Mass public quarantining, colloquially known as a lock-down, is a non-pharmaceutical intervention to check spread of disease. This paper presents ESOP (Epidemiologically and Socio-economically Optimal Policies), a novel application of active machine learning techniques using Bayesian optimization, that interacts with an epidemiological model to arrive at lock-down schedules that optimally balance public health benefits and socio-economic downsides of reduced economic activity during lock-down periods. The utility of ESOP is demonstrated using case studies with VIPER (Virus-Individual-Policy-EnviRonment), a stochastic agent-based simulator that this paper also proposes. However, ESOP is flexible enough to interact with arbitrary epidemiological simulators in a black-box manner, and produce schedules that involve multiple phases of lock-downs. △ Less

Submitted 14 June, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

Comments: Keywords: COVID-19, Optimal Policy, Lock-down, Epidemiology, Bayesian Optimization Code available at https://github.com/purushottamkar/esop

MSC Class: 92D30 (Primary) 90C26; 90C56; 60G15 (Secondary)

arXiv:1905.11769 [pdf, other]

Accelerating Extreme Classification via Adaptive Feature Agglomeration

Authors: Ankit Jalan, Purushottam Kar

Abstract: Extreme classification seeks to assign each data point, the most relevant labels from a universe of a million or more labels. This task is faced with the dual challenge of high precision and scalability, with millisecond level prediction times being a benchmark. We propose DEFRAG, an adaptive feature agglomeration technique to accelerate extreme classification algorithms. Despite past works on fea… ▽ More Extreme classification seeks to assign each data point, the most relevant labels from a universe of a million or more labels. This task is faced with the dual challenge of high precision and scalability, with millisecond level prediction times being a benchmark. We propose DEFRAG, an adaptive feature agglomeration technique to accelerate extreme classification algorithms. Despite past works on feature clustering and selection, DEFRAG distinguishes itself in being able to scale to millions of features, and is especially beneficial when feature sets are sparse, which is typical of recommendation and multi-label datasets. The method comes with provable performance guarantees and performs efficient task-driven agglomeration to reduce feature dimensionalities by an order of magnitude or more. Experiments show that DEFRAG can not only reduce training and prediction times of several leading extreme classification algorithms by as much as 40%, but also be used for feature reconstruction to address the problem of missing features, as well as offer superior coverage on rare labels. △ Less

Submitted 28 May, 2019; originally announced May 2019.

Comments: A version of this paper without the appendices will appear at the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019). Code for this paper is available at https://github.com/purushottamkar/defrag/

arXiv:1902.00491 [pdf, other]

doi 10.1016/j.neunet.2020.07.026

DANTE: Deep AlterNations for Training nEural networks

Authors: Vaibhav B Sinha, Sneha Kudugunta, Adepu Ravi Sankar, Surya Teja Chavali, Purushottam Kar, Vineeth N Balasubramanian

Abstract: We present DANTE, a novel method for training neural networks using the alternating minimization principle. DANTE provides an alternate perspective to traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convexity to cast training a neural network as a bi-quasi-convex optimization problem. We show that for neural network con… ▽ More We present DANTE, a novel method for training neural networks using the alternating minimization principle. DANTE provides an alternate perspective to traditional gradient-based backpropagation techniques commonly used to train deep networks. It utilizes an adaptation of quasi-convexity to cast training a neural network as a bi-quasi-convex optimization problem. We show that for neural network configurations with both differentiable (e.g. sigmoid) and non-differentiable (e.g. ReLU) activation functions, we can perform the alternations effectively in this formulation. DANTE can also be extended to networks with multiple hidden layers. In experiments on standard datasets, neural networks trained using the proposed method were found to be promising and competitive to traditional backpropagation techniques, both in terms of quality of the solution, as well as training speed. △ Less

Submitted 9 August, 2020; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: 19 pages

Journal ref: Neural Networks 131 (2020) 127-143

arXiv:1802.00086 [pdf, other]

doi 10.1007/s10994-018-5736-y

Optimizing Non-decomposable Measures with Deep Networks

Authors: Amartya Sanyal, Pawan Kumar, Purushottam Kar, Sanjay Chawla, Fabrizio Sebastiani

Abstract: We present a class of algorithms capable of directly training deep neural networks with respect to large families of task-specific performance measures such as the F-measure and the Kullback-Leibler divergence that are structured and non-decomposable. This presents a departure from standard deep learning techniques that typically use squared or cross-entropy loss functions (that are decomposable)… ▽ More We present a class of algorithms capable of directly training deep neural networks with respect to large families of task-specific performance measures such as the F-measure and the Kullback-Leibler divergence that are structured and non-decomposable. This presents a departure from standard deep learning techniques that typically use squared or cross-entropy loss functions (that are decomposable) to train neural networks. We demonstrate that directly training with task-specific loss functions yields much faster and more stable convergence across problems and datasets. Our proposed algorithms and implementations have several novel features including (i) convergence to first order stationary points despite optimizing complex objective functions; (ii) use of fewer training samples to achieve a desired level of convergence, (iii) a substantial reduction in training time, and (iv) a seamless integration of our implementation into existing symbolic gradient frameworks. We implement our techniques on a variety of deep architectures including multi-layer perceptrons and recurrent neural networks and show that on a variety of benchmark and real data sets, our algorithms outperform traditional approaches to training deep networks, as well as some recent approaches to task-specific training of neural networks. △ Less

Submitted 31 January, 2018; originally announced February 2018.

Journal ref: Final version published in Machine Learning, 107(8-10):1597-1620, 2018

arXiv:1712.07897 [pdf, other]

doi 10.1561/2200000058

Non-convex Optimization for Machine Learning

Authors: Prateek Jain, Purushottam Kar

Abstract: A vast majority of machine learning algorithms train their models and perform inference by solving optimization problems. In order to capture the learning and prediction problems accurately, structural constraints such as sparsity or low rank are frequently imposed or else the objective itself is designed to be a non-convex function. This is especially true of algorithms that operate in high-dimen… ▽ More A vast majority of machine learning algorithms train their models and perform inference by solving optimization problems. In order to capture the learning and prediction problems accurately, structural constraints such as sparsity or low rank are frequently imposed or else the objective itself is designed to be a non-convex function. This is especially true of algorithms that operate in high-dimensional spaces or that train non-linear models such as tensor models and deep networks. The freedom to express the learning problem as a non-convex optimization problem gives immense modeling power to the algorithm designer, but often such problems are NP-hard to solve. A popular workaround to this has been to relax non-convex problems to convex ones and use traditional methods to solve the (convex) relaxed optimization problems. However this approach may be lossy and nevertheless presents significant challenges for large scale optimization. On the other hand, direct approaches to non-convex optimization have met with resounding success in several domains and remain the methods of choice for the practitioner, as they frequently outperform relaxation-based techniques - popular heuristics include projected gradient descent and alternating minimization. However, these are often poorly understood in terms of their convergence and other properties. This monograph presents a selection of recent advances that bridge a long-standing gap in our understanding of these heuristics. The monograph will lead the reader through several widely used non-convex optimization techniques, as well as applications thereof. The goal of this monograph is to both, introduce the rich literature in this area, as well as equip the reader with the tools and techniques needed to analyze these simple procedures for non-convex problems. △ Less

Submitted 21 December, 2017; originally announced December 2017.

Comments: The official publication is available from now publishers via http://dx.doi.org/10.1561/2200000058

Journal ref: Foundations and Trends in Machine Learning: Vol. 10: No. 3-4, pp 142-336 (2017)

arXiv:1608.03544 [pdf, other]

On Context-Dependent Clustering of Bandits

Authors: Claudio Gentile, Shuai Li, Purushottam Kar, Alexandros Karatzoglou, Evans Etrue, Giovanni Zappella

Abstract: We investigate a novel cluster-of-bandit algorithm CAB for collaborative recommendation tasks that implements the underlying feedback sharing mechanism by estimating the neighborhood of users in a context-dependent manner. CAB makes sharp departures from the state of the art by incorporating collaborative effects into inference as well as learning processes in a manner that seamlessly interleaving… ▽ More We investigate a novel cluster-of-bandit algorithm CAB for collaborative recommendation tasks that implements the underlying feedback sharing mechanism by estimating the neighborhood of users in a context-dependent manner. CAB makes sharp departures from the state of the art by incorporating collaborative effects into inference as well as learning processes in a manner that seamlessly interleaving explore-exploit tradeoffs and collaborative steps. We prove regret bounds under various assumptions on the data, which exhibit a crisp dependence on the expected number of clusters over the users, a natural measure of the statistical difficulty of the learning task. Experiments on production and real-world datasets show that CAB offers significantly increased prediction performance against a representative pool of state-of-the-art methods. △ Less

Submitted 27 February, 2017; v1 submitted 6 August, 2016; originally announced August 2016.

arXiv:1607.00146 [pdf, ps, other]

Efficient and Consistent Robust Time Series Analysis

Authors: Kush Bhatia, Prateek Jain, Parameswaran Kamalaruban, Purushottam Kar

Abstract: We study the problem of robust time series analysis under the standard auto-regressive (AR) time series model in the presence of arbitrary outliers. We devise an efficient hard thresholding based algorithm which can obtain a consistent estimate of the optimal AR model despite a large fraction of the time series points being corrupted. Our algorithm alternately estimates the corrupted set of points… ▽ More We study the problem of robust time series analysis under the standard auto-regressive (AR) time series model in the presence of arbitrary outliers. We devise an efficient hard thresholding based algorithm which can obtain a consistent estimate of the optimal AR model despite a large fraction of the time series points being corrupted. Our algorithm alternately estimates the corrupted set of points and the model parameters, and is inspired by recent advances in robust regression and hard-thresholding methods. However, a direct application of existing techniques is hindered by a critical difference in the time-series domain: each point is correlated with all previous points rendering existing tools inapplicable directly. We show how to overcome this hurdle using novel proof techniques. Using our techniques, we are also able to provide the first efficient and provably consistent estimator for the robust regression problem where a standard linear observation model with white additive noise is corrupted arbitrarily. We illustrate our methods on synthetic datasets and show that our methods indeed are able to consistently recover the optimal parameters despite a large fraction of points being corrupted. △ Less

Submitted 1 July, 2016; originally announced July 2016.

arXiv:1605.04135 [pdf, other]

doi 10.1145/2939672.2939832

Online Optimization Methods for the Quantification Problem

Authors: Purushottam Kar, Shuai Li, Harikrishna Narasimhan, Sanjay Chawla, Fabrizio Sebastiani

Abstract: The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather es… ▽ More The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather estimate the overall distribution of positive and negative sentiments during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labeled data. Contemporary literature cites several performance measures used to measure the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization and we show, by a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures. △ Less

Submitted 13 June, 2016; v1 submitted 13 May, 2016; originally announced May 2016.

Comments: 26 pages, 6 figures. A short version of this manuscript will appear in the proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2016

Journal ref: Final version published in Proceedings of the 22nd ACM Conference on Knowledge Discovery and Data Mining (KDD 2016), San Francisco, US, 2016, pp. 1625-1634

arXiv:1510.03164

Context-Aware Bandits

Authors: Shuai Li, Purushottam Kar

Abstract: We propose an efficient Context-Aware clustering of Bandits (CAB) algorithm, which can capture collaborative effects. CAB can be easily deployed in a real-world recommendation system, where multi-armed bandits have been shown to perform well in particular with respect to the cold-start problem. CAB utilizes a context-aware clustering augmented by exploration-exploitation strategies. CAB dynamicall… ▽ More We propose an efficient Context-Aware clustering of Bandits (CAB) algorithm, which can capture collaborative effects. CAB can be easily deployed in a real-world recommendation system, where multi-armed bandits have been shown to perform well in particular with respect to the cold-start problem. CAB utilizes a context-aware clustering augmented by exploration-exploitation strategies. CAB dynamically clusters the users based on the content universe under consideration. We give a theoretical analysis in the standard stochastic multi-armed bandits setting. We show the efficiency of our approach on production and real-world datasets, demonstrate the scalability, and, more importantly, the significant increased prediction performance against several state-of-the-art methods. △ Less

Submitted 26 February, 2017; v1 submitted 12 October, 2015; originally announced October 2015.

Comments: The paper has been withdrawn as the work has been superseded

arXiv:1507.02743 [pdf, ps, other]

Locally Non-linear Embeddings for Extreme Multi-label Learning

Authors: Kush Bhatia, Himanshu Jain, Purushottam Kar, Prateek Jain, Manik Varma

Abstract: The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensio… ▽ More The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications. This paper develops the X-One classifier to address both limitations. The main technical contribution in X-One is a formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels. This allows X-One to break free of the traditional low-rank assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors. We conducted extensive experiments on several real-world as well as benchmark data sets and compared our method against state-of-the-art methods for extreme multi-label classification. Experiments reveal that X-One can make significantly more accurate predictions then the state-of-the-art methods including both embeddings (by as much as 35%) as well as trees (by as much as 6%). X-One can also scale efficiently to data sets with a million labels which are beyond the pale of leading embedding methods. △ Less

Submitted 9 July, 2015; originally announced July 2015.

arXiv:1506.02428 [pdf, other]

Robust Regression via Hard Thresholding

Authors: Kush Bhatia, Prateek Jain, Purushottam Kar

Abstract: We study the problem of Robust Least Squares Regression (RLSR) where several response variables can be adversarially corrupted. More specifically, for a data matrix X \in R^{p x n} and an underlying model w*, the response vector is generated as y = X'w* + b where b \in R^n is the corruption vector supported over at most C.n coordinates. Existing exact recovery results for RLSR focus solely on L1-p… ▽ More We study the problem of Robust Least Squares Regression (RLSR) where several response variables can be adversarially corrupted. More specifically, for a data matrix X \in R^{p x n} and an underlying model w*, the response vector is generated as y = X'w* + b where b \in R^n is the corruption vector supported over at most C.n coordinates. Existing exact recovery results for RLSR focus solely on L1-penalty based convex formulations and impose relatively strict model assumptions such as requiring the corruptions b to be selected independently of X. In this work, we study a simple hard-thresholding algorithm called TORRENT which, under mild conditions on X, can recover w* exactly even if b corrupts the response variables in an adversarial manner, i.e. both the support and entries of b are selected adversarially after observing X and w*. Our results hold under deterministic assumptions which are satisfied if X is sampled from any sub-Gaussian distribution. Finally unlike existing results that apply only to a fixed w*, generated independently of X, our results are universal and hold for any w* \in R^p. Next, we propose gradient descent-based extensions of TORRENT that can scale efficiently to large scale problems, such as high dimensional sparse recovery and prove similar recovery guarantees for these extensions. Empirically we find TORRENT, and more so its extensions, offering significantly faster recovery than the state-of-the-art L1 solvers. For instance, even on moderate-sized datasets (with p = 50K) with around 40% corrupted responses, a variant of our proposed method called TORRENT-HYB is more than 20x faster than the best L1 solver. △ Less

Submitted 8 June, 2015; originally announced June 2015.

Comments: 24 pages, 3 figures

arXiv:1505.06813 [pdf, other]

Surrogate Functions for Maximizing Precision at the Top

Authors: Purushottam Kar, Harikrishna Narasimhan, Prateek Jain

Abstract: The problem of maximizing precision at the top of a ranked list, often dubbed Precision@k (prec@k), finds relevance in myriad learning applications such as ranking, multi-label classification, and learning with severe label imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its associated performance measure. The most notable of thes… ▽ More The problem of maximizing precision at the top of a ranked list, often dubbed Precision@k (prec@k), finds relevance in myriad learning applications such as ranking, multi-label classification, and learning with severe label imbalance. However, despite its popularity, there exist significant gaps in our understanding of this problem and its associated performance measure. The most notable of these is the lack of a convex upper bounding surrogate for prec@k. We also lack scalable perceptron and stochastic gradient descent algorithms for optimizing this performance measure. In this paper we make key contributions in these directions. At the heart of our results is a family of truly upper bounding surrogates for prec@k. These surrogates are motivated in a principled manner and enjoy attractive properties such as consistency to prec@k under various natural margin/noise conditions. These surrogates are then used to design a class of novel perceptron algorithms for optimizing prec@k with provable mistake bounds. We also devise scalable stochastic gradient descent style methods for this problem with provable convergence bounds. Our proofs rely on novel uniform convergence bounds which require an in-depth analysis of the structural properties of prec@k and its surrogates. We conclude with experimental results comparing our algorithms with state-of-the-art cutting plane and stochastic gradient algorithms for maximizing prec@k. △ Less

Submitted 26 May, 2015; originally announced May 2015.

Comments: To appear in the the proceedings of the 32nd International Conference on Machine Learning (ICML 2015)

Journal ref: Journal of Machine Learning Research, W&CP 37 (2015)

arXiv:1505.06812 [pdf, other]

Optimizing Non-decomposable Performance Measures: A Tale of Two Classes

Authors: Harikrishna Narasimhan, Purushottam Kar, Prateek Jain

Abstract: Modern classification problems frequently present mild to severe label imbalance as well as specific requirements on classification characteristics, and require optimizing performance measures that are non-decomposable over the dataset, such as F-measure. Such measures have spurred much interest and pose specific challenges to learning algorithms since their non-additive nature precludes a direct… ▽ More Modern classification problems frequently present mild to severe label imbalance as well as specific requirements on classification characteristics, and require optimizing performance measures that are non-decomposable over the dataset, such as F-measure. Such measures have spurred much interest and pose specific challenges to learning algorithms since their non-additive nature precludes a direct application of well-studied large scale optimization methods such as stochastic gradient descent. In this paper we reveal that for two large families of performance measures that can be expressed as functions of true positive/negative rates, it is indeed possible to implement point stochastic updates. The families we consider are concave and pseudo-linear functions of TPR, TNR which cover several popularly used performance measures such as F-measure, G-mean and H-mean. Our core contribution is an adaptive linearization scheme for these families, using which we develop optimization techniques that enable truly point-based stochastic updates. For concave performance measures we propose SPADE, a stochastic primal dual solver; for pseudo-linear measures we propose STAMP, a stochastic alternate maximization procedure. Both methods have crisp convergence guarantees, demonstrate significant speedups over existing methods - often by an order of magnitude or more, and give similar or more accurate predictions on test data. △ Less

Submitted 26 May, 2015; originally announced May 2015.

Comments: To appear in proceedings of the 32nd International Conference on Machine Learning (ICML 2015)

Journal ref: Journal of Machine Learning Research, W&CP 37 (2015)

arXiv:1410.6776 [pdf, other]

Online and Stochastic Gradient Methods for Non-decomposable Loss Functions

Authors: Purushottam Kar, Harikrishna Narasimhan, Prateek Jain

Abstract: Modern applications in sensitive domains such as biometrics and medicine frequently require the use of non-decomposable loss functions such as precision@k, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a… ▽ More Modern applications in sensitive domains such as biometrics and medicine frequently require the use of non-decomposable loss functions such as precision@k, F-measure etc. Compared to point loss functions such as hinge-loss, these offer much more fine grained control over prediction, but at the same time present novel challenges in terms of algorithm design and analysis. In this work we initiate a study of online learning techniques for such non-decomposable loss functions with an aim to enable incremental learning as well as design scalable solvers for batch problems. To this end, we propose an online learning framework for such loss functions. Our model enjoys several nice properties, chief amongst them being the existence of efficient online learning algorithms with sublinear regret and online to batch conversion bounds. Our model is a provable extension of existing online learning models for point loss functions. We instantiate two popular losses, prec@k and pAUC, in our model and prove sublinear regret bounds for both of them. Our proofs require a novel structural lemma over ranked lists which may be of independent interest. We then develop scalable stochastic gradient descent solvers for non-decomposable loss functions. We show that for a large family of loss functions satisfying a certain uniform convergence property (that includes prec@k, pAUC, and F-measure), our methods provably converge to the empirical risk minimizer. Such uniform convergence results were not known for these losses and we establish these using novel proof techniques. We then use extensive experimentation on real life and benchmark datasets to establish that our method can be orders of magnitude faster than a recently proposed cutting plane method. △ Less

Submitted 24 October, 2014; originally announced October 2014.

Comments: 25 pages, 3 figures, To appear in the proceedings of the 28th Annual Conference on Neural Information Processing Systems, NIPS 2014

arXiv:1410.5137 [pdf, other]

On Iterative Hard Thresholding Methods for High-dimensional M-Estimation

Authors: Prateek Jain, Ambuj Tewari, Purushottam Kar

Abstract: The use of M-estimators in generalized linear regression models in high dimensional settings requires risk minimization with hard $L_0$ constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current state-of-the-art is only able to analyze these meth… ▽ More The use of M-estimators in generalized linear regression models in high dimensional settings requires risk minimization with hard $L_0$ constraints. Of the known methods, the class of projected gradient descent (also known as iterative hard thresholding (IHT)) methods is known to offer the fastest and most scalable solutions. However, the current state-of-the-art is only able to analyze these methods in extremely restrictive settings which do not hold in high dimensional statistical models. In this work we bridge this gap by providing the first analysis for IHT-style methods in the high dimensional statistical setting. Our bounds are tight and match known minimax lower bounds. Our results rely on a general analysis framework that enables us to analyze several popular hard thresholding style algorithms (such as HTP, CoSaMP, SP) in the high dimensional regression setting. We also extend our analysis to a large family of "fully corrective methods" that includes two-stage and partial hard-thresholding algorithms. We show that our results hold for the problem of sparse regression, as well as low-rank matrix recovery. △ Less

Submitted 21 October, 2014; v1 submitted 19 October, 2014; originally announced October 2014.

Comments: 20 pages, 3 figures, To appear in the proceedings of the 28th Annual Conference on Neural Information Processing Systems, NIPS 2014

arXiv:1305.2505 [pdf, other]

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Authors: Purushottam Kar, Bharath K Sriperumbudur, Prateek Jain, Harish C Karnick

Abstract: In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than… ▽ More In this paper, we study the generalization properties of online learning based stochastic methods for supervised learning problems where the loss function is dependent on more than one training sample (e.g., metric learning, ranking). We present a generic decoupling technique that enables us to provide Rademacher complexity-based generalization error bounds. Our bounds are in general tighter than those obtained by Wang et al (COLT 2012) for the same problem. Using our decoupling technique, we are further able to obtain fast convergence rates for strongly convex pairwise loss functions. We are also able to analyze a class of memory efficient online learning algorithms for pairwise learning problems that use only a bounded subset of past training samples to update the hypothesis at each step. Finally, in order to complement our generalization bounds, we propose a novel memory efficient online learning algorithm for higher order learning problems with bounded regret guarantees. △ Less

Submitted 11 May, 2013; originally announced May 2013.

Comments: To appear in proceedings of the 30th International Conference on Machine Learning (ICML 2013)

Journal ref: Journal of Machine Learning Research, W&CP 28(3) (2013)

arXiv:1302.4343 [pdf, ps, other]

On Translation Invariant Kernels and Screw Functions

Authors: Purushottam Kar, Harish Karnick

Abstract: We explore the connection between Hilbertian metrics and positive definite kernels on the real line. In particular, we look at a well-known characterization of translation invariant Hilbertian metrics on the real line by von Neumann and Schoenberg (1941). Using this result we are able to give an alternate proof of Bochner's theorem for translation invariant positive definite kernels on the real li… ▽ More We explore the connection between Hilbertian metrics and positive definite kernels on the real line. In particular, we look at a well-known characterization of translation invariant Hilbertian metrics on the real line by von Neumann and Schoenberg (1941). Using this result we are able to give an alternate proof of Bochner's theorem for translation invariant positive definite kernels on the real line (Rudin, 1962). △ Less

Submitted 18 February, 2013; originally announced February 2013.

arXiv:1302.0406 [pdf, ps, other]

Generalization Guarantees for a Binary Classification Framework for Two-Stage Multiple Kernel Learning

Authors: Purushottam Kar

Abstract: We present generalization bounds for the TS-MKL framework for two stage multiple kernel learning. We also present bounds for sparse kernel learning formulations within the TS-MKL framework. We present generalization bounds for the TS-MKL framework for two stage multiple kernel learning. We also present bounds for sparse kernel learning formulations within the TS-MKL framework. △ Less

Submitted 2 February, 2013; originally announced February 2013.

arXiv:1210.5840 [pdf, other]

Supervised Learning with Similarity Functions

Authors: Purushottam Kar, Prateek Jain

Abstract: We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multi-class classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed… ▽ More We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multi-class classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a "goodness" criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using "good" similarity functions. We demonstrate the effectiveness of our model on three important super-vised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs. △ Less

Submitted 22 October, 2012; originally announced October 2012.

Comments: To appear in the proceedings of NIPS 2012, 30 pages

arXiv:1201.6530 [pdf, other]

Random Feature Maps for Dot Product Kernels

Authors: Purushottam Kar, Harish Karnick

Abstract: Approximating non-linear kernels using feature maps has gained a lot of interest in recent years due to applications in reducing training and testing times of SVM classifiers and other kernel based learning algorithms. We extend this line of work and present low distortion embeddings for dot product kernels into linear Euclidean spaces. We base our results on a classical result in harmonic analysi… ▽ More Approximating non-linear kernels using feature maps has gained a lot of interest in recent years due to applications in reducing training and testing times of SVM classifiers and other kernel based learning algorithms. We extend this line of work and present low distortion embeddings for dot product kernels into linear Euclidean spaces. We base our results on a classical result in harmonic analysis characterizing all dot product kernels and use it to define randomized feature maps into explicit low dimensional Euclidean spaces in which the native dot product provides an approximation to the dot product kernel with high confidence. △ Less

Submitted 26 March, 2012; v1 submitted 31 January, 2012; originally announced January 2012.

Comments: To appear in the proceedings of the 15th International Conference on Artificial Intelligence and Statistics (AISTATS 2012). This version corrects a minor error with Lemma 10. Acknowledgements : Devanshu Bhimwal

Journal ref: Journal of Machine Learning Research, W&CP 22 (2012) 583-591

arXiv:1112.5404 [pdf, ps, other]

Similarity-based Learning via Data Driven Embeddings

Authors: Purushottam Kar, Prateek Jain

Abstract: We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [Balca… ▽ More We consider the problem of classification using similarity/distance functions over data. Specifically, we propose a framework for defining the goodness of a (dis)similarity function with respect to a given learning task and propose algorithms that have guaranteed generalization properties when working with such good functions. Our framework unifies and generalizes the frameworks proposed by [Balcan-Blum ICML 2006] and [Wang et al ICML 2007]. An attractive feature of our framework is its adaptability to data - we do not promote a fixed notion of goodness but rather let data dictate it. We show, by giving theoretical guarantees that the goodness criterion best suited to a problem can itself be learned which makes our approach applicable to a variety of domains and problems. We propose a landmarking-based approach to obtaining a classifier from such learned goodness criteria. We then provide a novel diversity based heuristic to perform task-driven selection of landmark points instead of random selection. We demonstrate the effectiveness of our goodness criteria learning method as well as the landmark selection heuristic on a variety of similarity-based learning datasets and benchmark UCI datasets on which our method consistently outperforms existing approaches by a significant margin. △ Less

Submitted 22 December, 2011; originally announced December 2011.

Comments: To appear in the proceedings of NIPS 2011, 14 pages

arXiv:1010.3812 [pdf, ps, other]

Random Projection Trees Revisited

Authors: Aman Dhesi, Purushottam Kar

Abstract: The Random Projection Tree structures proposed in [Freund-Dasgupta STOC08] are space partitioning data structures that automatically adapt to various notions of intrinsic dimensionality of data. We prove new results for both the RPTreeMax and the RPTreeMean data structures. Our result for RPTreeMax gives a near-optimal bound on the number of levels required by this data structure to reduce the siz… ▽ More The Random Projection Tree structures proposed in [Freund-Dasgupta STOC08] are space partitioning data structures that automatically adapt to various notions of intrinsic dimensionality of data. We prove new results for both the RPTreeMax and the RPTreeMean data structures. Our result for RPTreeMax gives a near-optimal bound on the number of levels required by this data structure to reduce the size of its cells by a factor $s \geq 2$. We also prove a packing lemma for this data structure. Our final result shows that low-dimensional manifolds have bounded Local Covariance Dimension. As a consequence we show that RPTreeMean adapts to manifold dimension as well. △ Less

Submitted 20 October, 2010; v1 submitted 19 October, 2010; originally announced October 2010.

Comments: Accepted for publication at NIPS 2010. This version corrects an incorrect usage of the term Assouad dimension - acknowledgments : James Lee

Showing 1–27 of 27 results for author: Kar, P