Search | arXiv e-print repository

Separations in the Representational Capabilities of Transformers and Recurrent Architectures

Authors: Satwik Bhattamishra, Michael Hahn, Phil Blunsom, Varun Kanade

Abstract: Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, rec… ▽ More Transformer architectures have been widely adopted in foundation models. Due to their high inference costs, there is renewed interest in exploring the potential of efficient recurrent architectures (RNNs). In this paper, we analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance, including index lookup, nearest neighbor, recognizing bounded Dyck languages, and string equality. For the tasks considered, our results show separations based on the size of the model required for different architectures. For example, we show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. Conversely, while constant-size RNNs can recognize bounded Dyck languages, we show that one-layer Transformers require a linear size for this task. Furthermore, we show that two-layer Transformers of logarithmic size can perform decision tasks such as string equality or disjointness, whereas both one-layer Transformers and recurrent models require linear size for these tasks. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass; on the other hand recurrent models require linear size. Our constructions are based on the existence of $N$ nearly orthogonal vectors in $O(\log N)$ dimensional space and our lower bounds are based on reductions from communication complexity problems. We supplement our theoretical results with experiments that highlight the differences in the performance of these architectures on practical-size sequences. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2310.03016 [pdf, other]

Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

Authors: Satwik Bhattamishra, Arkil Patel, Phil Blunsom, Varun Kanade

Abstract: In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood.… ▽ More In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained Large Language Models (LLMs). In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set. △ Less

Submitted 4 October, 2023; originally announced October 2023.

Comments: Preprint

arXiv:2211.12316 [pdf, other]

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

Authors: Satwik Bhattamishra, Arkil Patel, Varun Kanade, Phil Blunsom

Abstract: Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Bo… ▽ More Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased towards functions of low sensitivity. (ii) When trained on Boolean functions, both Transformers and LSTMs prioritize learning functions of low sensitivity, with Transformers ultimately converging to functions of lower sensitivity. (iii) On sparse Boolean functions which have low sensitivity, we find that Transformers generalize near perfectly even in the presence of noisy labels whereas LSTMs overfit and achieve poor generalization accuracy. Overall, our results provide strong quantifiable evidence that suggests differences in the inductive biases of Transformers and recurrent models which may help explain Transformer's effective generalization performance despite relatively limited expressiveness. △ Less

Submitted 10 July, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: ACL 2023

arXiv:2210.06089 [pdf, other]

When are Local Queries Useful for Robust Learning?

Authors: Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, James Worrell

Abstract: Distributional assumptions have been shown to be necessary for the robust learnability of concept classes when considering the exact-in-the-ball robust risk and access to random examples by Gourdeau et al. (2019). In this paper, we study learning models where the learner is given more power through the use of local queries, and give the first distribution-free algorithms that perform robust empiri… ▽ More Distributional assumptions have been shown to be necessary for the robust learnability of concept classes when considering the exact-in-the-ball robust risk and access to random examples by Gourdeau et al. (2019). In this paper, we study learning models where the learner is given more power through the use of local queries, and give the first distribution-free algorithms that perform robust empirical risk minimization (ERM) for this notion of robustness. The first learning model we consider uses local membership queries (LMQ), where the learner can query the label of points near the training sample. We show that, under the uniform distribution, LMQs do not increase the robustness threshold of conjunctions and any superclass, e.g., decision lists and halfspaces. Faced with this negative result, we introduce the local equivalence query ($\mathsf{LEQ}$) oracle, which returns whether the hypothesis and target concept agree in the perturbation region around a point in the training sample, as well as a counterexample if it exists. We show a separation result: on the one hand, if the query radius $λ$ is strictly smaller than the adversary's perturbation budget $ρ$, then distribution-free robust learning is impossible for a wide variety of concept classes; on the other hand, the setting $λ=ρ$ allows us to develop robust ERM algorithms. We then bound the query complexity of these algorithms based on online learning guarantees and further improve these bounds for the special case of conjunctions. We finish by giving robust learning algorithms for halfspaces on $\{0,1\}^n$ and then obtaining robustness guarantees for halfspaces in $\mathbb{R}^n$ against precision-bounded adversaries. △ Less

Submitted 20 July, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to NeurIPS 2022; V2 contains new results (Section 3.6) and an erratum from the previous version (Appendix C)

arXiv:2208.12063 [pdf, other]

Partial Matrix Completion

Authors: Elad Hazan, Adam Tauman Kalai, Varun Kanade, Clara Mohri, Y. Jennifer Sun

Abstract: The matrix completion problem aims to reconstruct a low-rank matrix based on a revealed set of possibly noisy entries. Prior works consider completing the entire matrix with generalization error guarantees. However, the completion accuracy can be drastically different over different entries. This work establishes a new framework of partial matrix completion, where the goal is to identify a large s… ▽ More The matrix completion problem aims to reconstruct a low-rank matrix based on a revealed set of possibly noisy entries. Prior works consider completing the entire matrix with generalization error guarantees. However, the completion accuracy can be drastically different over different entries. This work establishes a new framework of partial matrix completion, where the goal is to identify a large subset of the entries that can be completed with high confidence. We propose an efficient algorithm with the following provable guarantees. Given access to samples from an unknown and arbitrary distribution, it guarantees: (a) high accuracy over completed entries, and (b) high coverage of the underlying distribution. We also consider an online learning variant of this problem, where we propose a low-regret algorithm based on iterative gradient updates. Preliminary empirical evaluations are included. △ Less

Submitted 17 December, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

Comments: NeurIPS 2023

arXiv:2205.12327 [pdf, other]

Beyond Impossibility: Balancing Sufficiency, Separation and Accuracy

Authors: Limor Gultchin, Vincent Cohen-Addad, Sophie Giffard-Roisin, Varun Kanade, Frederik Mallmann-Trenn

Abstract: Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic comm… ▽ More Among the various aspects of algorithmic fairness studied in recent years, the tension between satisfying both \textit{sufficiency} and \textit{separation} -- e.g. the ratios of positive or negative predictive values, and false positive or false negative rates across groups -- has received much attention. Following a debate sparked by COMPAS, a criminal justice predictive system, the academic community has responded by laying out important theoretical understanding, showing that one cannot achieve both with an imperfect predictor when there is no equal distribution of labels across the groups. In this paper, we shed more light on what might be still possible beyond the impossibility -- the existence of a trade-off means we should aim to find a good balance within it. After refining the existing theoretical result, we propose an objective that aims to balance \textit{sufficiency} and \textit{separation} measures, while maintaining similar accuracy levels. We show the use of such an objective in two empirical case studies, one involving a multi-objective framework, and the other fine-tuning of a model pre-trained for accuracy. We show promising results, where better trade-offs are achieved compared to existing alternatives. △ Less

Submitted 24 May, 2022; originally announced May 2022.

arXiv:2205.06127 [pdf, ps, other]

Sample Complexity Bounds for Robustly Learning Decision Lists against Evasion Attacks

Authors: Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, James Worrell

Abstract: A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satis… ▽ More A fundamental problem in adversarial machine learning is to quantify how much training data is needed in the presence of evasion attacks. In this paper we address this issue within the framework of PAC learning, focusing on the class of decision lists. Given that distributional assumptions are essential in the adversarial setting, we work with probability distributions on the input data that satisfy a Lipschitz condition: nearby points have similar probability. Our key results illustrate that the adversary's budget (that is, the number of bits it can perturb on each input) is a fundamental quantity in determining the sample complexity of robust learning. Our first main result is a sample-complexity lower bound: the class of monotone conjunctions (essentially the simplest non-trivial hypothesis class on the Boolean hypercube) and any superclass has sample complexity at least exponential in the adversary's budget. Our second main result is a corresponding upper bound: for every fixed $k$ the class of $k$-decision lists has polynomial sample complexity against a $\log(n)$-bounded adversary. This sheds further light on the question of whether an efficient PAC learning algorithm can always be used as an efficient $\log(n)$-robust learning algorithm under the uniform distribution. △ Less

Submitted 12 May, 2022; originally announced May 2022.

Comments: To appear in the proceedings of International Joint Conference on Artificial Intelligence (2022)

arXiv:2202.11461 [pdf, other]

Exponential Tail Local Rademacher Complexity Risk Bounds Without the Bernstein Condition

Authors: Varun Kanade, Patrick Rebeschini, Tomas Vaskevicius

Abstract: The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several exa… ▽ More The local Rademacher complexity framework is one of the most successful general-purpose toolboxes for establishing sharp excess risk bounds for statistical estimators based on the framework of empirical risk minimization. Applying this toolbox typically requires using the Bernstein condition, which often restricts applicability to convex and proper settings. Recent years have witnessed several examples of problems where optimal statistical performance is only achievable via non-convex and improper estimators originating from aggregation theory, including the fundamental problem of model selection. These examples are currently outside of the reach of the classical localization theory. In this work, we build upon the recent approach to localization via offset Rademacher complexities, for which a general high-probability theory has yet to be established. Our main result is an exponential-tail excess risk bound expressed in terms of the offset Rademacher complexity that yields results at least as sharp as those obtainable via the classical theory. However, our bound applies under an estimator-dependent geometric condition (the "offset condition") instead of the estimator-independent (but, in general, distribution-dependent) Bernstein condition on which the classical theory relies. Our results apply to improper prediction regimes not directly covered by the classical theory. △ Less

Submitted 23 February, 2022; originally announced February 2022.

arXiv:2105.14119 [pdf, other]

Towards optimally abstaining from prediction with OOD test examples

Authors: Adam Tauman Kalai, Varun Kanade

Abstract: A common challenge across all areas of machine learning is that training data is not distributed like test data, due to natural shifts, "blind spots," or adversarial examples; such test examples are referred to as out-of-distribution (OOD) test examples. We consider a model where one may abstain from predicting, at a fixed cost. In particular, our transductive abstention algorithm takes labeled tr… ▽ More A common challenge across all areas of machine learning is that training data is not distributed like test data, due to natural shifts, "blind spots," or adversarial examples; such test examples are referred to as out-of-distribution (OOD) test examples. We consider a model where one may abstain from predicting, at a fixed cost. In particular, our transductive abstention algorithm takes labeled training examples and unlabeled test examples as input, and provides predictions with optimal prediction loss guarantees. The loss bounds match standard generalization bounds when test examples are i.i.d. from the training distribution, but add an additional term that is the cost of abstaining times the statistical distance between the train and test distribution (or the fraction of adversarial examples). For linear regression, we give a polynomial-time algorithm based on Celis-Dennis-Tapia optimization algorithms. For binary classification, we show how to efficiently implement it using a proper agnostic learner (i.e., an Empirical Risk Minimizer) for the class of interest. Our work builds on a recent abstention algorithm of Goldwasser, Kalais, and Montasser (2020) for transductive binary classification. △ Less

Submitted 27 October, 2021; v1 submitted 28 May, 2021; originally announced May 2021.

Comments: In NeurIPS 2021 (+spotlight), 24 pages

arXiv:2102.07802 [pdf, ps, other]

Efficient Learning with Arbitrary Covariate Shift

Authors: Adam Kalai, Varun Kanade

Abstract: We give an efficient algorithm for learning a binary function in a given class C of bounded VC dimension, with training data distributed according to P and test data according to Q, where P and Q may be arbitrary distributions over X. This is the generic form of what is called covariate shift, which is impossible in general as arbitrary P and Q may not even overlap. However, recently guarantees we… ▽ More We give an efficient algorithm for learning a binary function in a given class C of bounded VC dimension, with training data distributed according to P and test data according to Q, where P and Q may be arbitrary distributions over X. This is the generic form of what is called covariate shift, which is impossible in general as arbitrary P and Q may not even overlap. However, recently guarantees were given in a model called PQ-learning (Goldwasser et al., 2020) where the learner has: (a) access to unlabeled test examples from Q (in addition to labeled samples from P, i.e., semi-supervised learning); and (b) the option to reject any example and abstain from classifying it (i.e., selective classification). The algorithm of Goldwasser et al. (2020) requires an (agnostic) noise tolerant learner for C. The present work gives a polynomial-time PQ-learning algorithm that uses an oracle to a "reliable" learner for C, where reliable learning (Kalai et al., 2012) is a model of learning with one-sided noise. Furthermore, our reduction is optimal in the sense that we show the equivalence of reliable and PQ learning. △ Less

Submitted 15 February, 2021; originally announced February 2021.

arXiv:2007.08243 [pdf, ps, other]

Lottery Tickets in Linear Models: An Analysis of Iterative Magnitude Pruning

Authors: Bryn Elesedy, Varun Kanade, Yee Whye Teh

Abstract: We analyse the pruning procedure behind the lottery ticket hypothesis arXiv:1803.03635v5, iterative magnitude pruning (IMP), when applied to linear models trained by gradient flow. We begin by presenting sufficient conditions on the statistical structure of the features under which IMP prunes those features that have smallest projection onto the data. Following this, we explore IMP as a method for… ▽ More We analyse the pruning procedure behind the lottery ticket hypothesis arXiv:1803.03635v5, iterative magnitude pruning (IMP), when applied to linear models trained by gradient flow. We begin by presenting sufficient conditions on the statistical structure of the features under which IMP prunes those features that have smallest projection onto the data. Following this, we explore IMP as a method for sparse estimation. △ Less

Submitted 5 July, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: Updated for Sparsity in Neural Networks Workshop

ACM Class: I.5.1

arXiv:2007.04028 [pdf, other]

How benign is benign overfitting?

Authors: Amartya Sanyal, Puneet K Dokania, Varun Kanade, Philip H. S. Torr

Abstract: We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting [2, 10]. However, these models are vulnerable to ad… ▽ More We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting [2, 10]. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don't fit the noise. However, removing noisy labels alone does not suffice to achieve adversarial robustness. Standard training procedures bias neural networks towards learning "simple" classification boundaries, which may be less robust than more complex ones. We observe that adversarial training does produce more complex decision boundaries. We conjecture that in part the need for complex decision boundaries arises from sub-optimal representation learning. By means of simple toy examples, we show theoretically how the choice of representation can drastically affect adversarial robustness. △ Less

Submitted 8 July, 2020; originally announced July 2020.

arXiv:2003.01461 [pdf, other]

Differentiable Causal Backdoor Discovery

Authors: Limor Gultchin, Matt J. Kusner, Varun Kanade, Ricardo Silva

Abstract: Discovering the causal effect of a decision is critical to nearly all forms of decision-making. In particular, it is a key quantity in drug development, in crafting government policy, and when implementing a real-world machine learning system. Given only observational data, confounders often obscure the true causal effect. Luckily, in some cases, it is possible to recover the causal effect by usin… ▽ More Discovering the causal effect of a decision is critical to nearly all forms of decision-making. In particular, it is a key quantity in drug development, in crafting government policy, and when implementing a real-world machine learning system. Given only observational data, confounders often obscure the true causal effect. Luckily, in some cases, it is possible to recover the causal effect by using certain observed variables to adjust for the effects of confounders. However, without access to the true causal model, finding this adjustment requires brute-force search. In this work, we present an algorithm that exploits auxiliary variables, similar to instruments, in order to find an appropriate adjustment by a gradient-based optimization method. We demonstrate that it outperforms practical alternatives in estimating the true causal effect, without knowledge of the full causal graph. △ Less

Submitted 3 March, 2020; originally announced March 2020.

Comments: Published in the Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy

arXiv:2002.00189 [pdf, other]

The Statistical Complexity of Early-Stopped Mirror Descent

Authors: Tomas Vaškevičius, Varun Kanade, Patrick Rebeschini

Abstract: Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. By completi… ▽ More Recently there has been a surge of interest in understanding implicit regularization properties of iterative gradient-based optimization algorithms. In this paper, we study the statistical guarantees on the excess risk achieved by early-stopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. By completing an inequality that characterizes convexity for the squared loss, we identify an intrinsic link between offset Rademacher complexities and potential-based convergence analysis of mirror descent methods. Our observation immediately yields excess risk guarantees for the path traced by the iterates of mirror descent in terms of offset complexities of certain function classes depending only on the choice of the mirror map, initialization point, step-size, and the number of iterations. We apply our theory to recover, in a clean and elegant manner via rather short proofs, some of the recent results in the implicit regularization literature, while also showing how to improve upon them in some settings. △ Less

Submitted 27 August, 2020; v1 submitted 1 February, 2020; originally announced February 2020.

arXiv:1909.06861 [pdf, other]

Online k-means Clustering

Authors: Vincent Cohen-Addad, Benjamin Guedj, Varun Kanade, Guy Rom

Abstract: We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of $k$ clusters. The specific formulation we use is the $k$-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with r… ▽ More We study the problem of online clustering where a clustering algorithm has to assign a new point that arrives to one of $k$ clusters. The specific formulation we use is the $k$-means objective: At each time step the algorithm has to maintain a set of k candidate centers and the loss incurred is the squared distance between the new point and the closest center. The goal is to minimize regret with respect to the best solution to the $k$-means objective ($\mathcal{C}$) in hindsight. We show that provided the data lies in a bounded region, an implementation of the Multiplicative Weights Update Algorithm (MWUA) using a discretized grid achieves a regret bound of $\tilde{O}(\sqrt{T})$ in expectation. We also present an online-to-offline reduction that shows that an efficient no-regret online algorithm (despite being allowed to choose a different set of candidate centres at each round) implies an offline efficient algorithm for the $k$-means problem. In light of this hardness, we consider the slightly weaker requirement of comparing regret with respect to $(1 + ε) \mathcal{C}$ and present a no-regret algorithm with runtime $O\left(T(\mathrm{poly}(log(T),k,d,1/ε)^{k(d+O(1))}\right)$. Our algorithm is based on maintaining an incremental coreset and an adaptive variant of the MWUA. We show that naïve online algorithms, such as \emph{Follow The Leader}, fail to produce sublinear regret in the worst case. We also report preliminary experiments with synthetic and real-world data. △ Less

Submitted 15 September, 2019; originally announced September 2019.

Comments: 11 pages, 1 figure

Journal ref: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR 130:1126-1134, 2021

arXiv:1909.05822 [pdf, other]

On the Hardness of Robust Classification

Authors: Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, James Worrell

Abstract: It is becoming increasingly important to understand the vulnerability of machine learning models to adversarial attacks. In this paper we study the feasibility of robust learning from the perspective of computational learning theory, considering both sample and computational complexity. In particular, our definition of robust learnability requires polynomial sample complexity. We start with two ne… ▽ More It is becoming increasingly important to understand the vulnerability of machine learning models to adversarial attacks. In this paper we study the feasibility of robust learning from the perspective of computational learning theory, considering both sample and computational complexity. In particular, our definition of robust learnability requires polynomial sample complexity. We start with two negative results. We show that no non-trivial concept class can be robustly learned in the distribution-free setting against an adversary who can perturb just a single input bit. We show moreover that the class of monotone conjunctions cannot be robustly learned under the uniform distribution against an adversary who can perturb $ω(\log n)$ input bits. However if the adversary is restricted to perturbing $O(\log n)$ bits, then the class of monotone conjunctions can be robustly learned with respect to a general class of distributions (that includes the uniform distribution). Finally, we provide a simple proof of the computational hardness of robust learning on the boolean hypercube. Unlike previous results of this nature, our result does not rely on another computational model (e.g. the statistical query model) nor on any hardness assumption other than the existence of a hard learning problem in the PAC framework. △ Less

Submitted 12 September, 2019; originally announced September 2019.

Comments: To appear in the proceedings of Neural Information Processing Systems Conference (2019)

arXiv:1909.05122 [pdf, other]

Implicit Regularization for Optimal Sparse Recovery

Authors: Tomas Vaškevičius, Varun Kanade, Patrick Rebeschini

Abstract: We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. For a given parametrization yielding a non-convex optimization problem, we show that prescribed choices of initialization, st… ▽ More We investigate implicit regularization schemes for gradient descent methods applied to unpenalized least squares regression to solve the problem of reconstructing a sparse signal from an underdetermined system of linear measurements under the restricted isometry assumption. For a given parametrization yielding a non-convex optimization problem, we show that prescribed choices of initialization, step size and stop** time yield a statistically and computationally optimal algorithm that achieves the minimax rate with the same cost required to read the data up to poly-logarithmic factors. Beyond minimax optimality, we show that our algorithm adapts to instance difficulty and yields a dimension-independent rate when the signal-to-noise ratio is high enough. Key to the computational efficiency of our method is an increasing step size scheme that adapts to refined estimates of the true solution. We validate our findings with numerical experiments and compare our algorithm against explicit $\ell_{1}$ penalization. Going from hard instances to easy ones, our algorithm is seen to undergo a phase transition, eventually matching least squares with an oracle knowledge of the true support. △ Less

Submitted 11 September, 2019; originally announced September 2019.

Comments: To appear in NeurIPS 2019

arXiv:1905.11566 [pdf, other]

Adaptive Reduced Rank Regression

Authors: Qiong Wu, Felix Ming Fai Wong, Zhenming Liu, Yanhua Li, Varun Kanade

Abstract: We study the low rank regression problem $\my = M\mx + ε$, where $\mx$ and $\my$ are $d_1$ and $d_2$ dimensional vectors respectively. We consider the extreme high-dimensional setting where the number of observations $n$ is less than $d_1 + d_2$. Existing algorithms are designed for settings where $n$ is typically as large as $\Rank(M)(d_1+d_2)$. This work provides an efficient algorithm which onl… ▽ More We study the low rank regression problem $\my = M\mx + ε$, where $\mx$ and $\my$ are $d_1$ and $d_2$ dimensional vectors respectively. We consider the extreme high-dimensional setting where the number of observations $n$ is less than $d_1 + d_2$. Existing algorithms are designed for settings where $n$ is typically as large as $\Rank(M)(d_1+d_2)$. This work provides an efficient algorithm which only involves two SVD, and establishes statistical guarantees on its performance. The algorithm decouples the problem by first estimating the precision matrix of the features, and then solving the matrix denoising problem. To complement the upper bound, we introduce new techniques for establishing lower bounds on the performance of any algorithm for this problem. Our preliminary experiments confirm that our algorithm often out-performs existing baselines, and is always at least competitive. △ Less

Submitted 23 October, 2020; v1 submitted 27 May, 2019; originally announced May 2019.

Comments: 36 pages

arXiv:1810.04468 [pdf, other]

Decentralized Cooperative Stochastic Bandits

Authors: David Martínez-Rubio, Varun Kanade, Patrick Rebeschini

Abstract: We study a decentralized cooperative stochastic multi-armed bandit problem with $K$ arms on a network of $N$ agents. In our model, the reward distribution of each arm is the same for each agent and rewards are drawn independently across agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regr… ▽ More We study a decentralized cooperative stochastic multi-armed bandit problem with $K$ arms on a network of $N$ agents. In our model, the reward distribution of each arm is the same for each agent and rewards are drawn independently across agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regret of the entire network. We design a fully decentralized algorithm that uses an accelerated consensus procedure to compute (delayed) estimates of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound (UCB) algorithm that accounts for the delay and error of the estimates. We analyze the regret of our algorithm and also provide a lower bound. The regret is bounded by the optimal centralized regret plus a natural and simple term depending on the spectral gap of the communication matrix. Our algorithm is simpler to analyze than those proposed in prior work and it achieves better regret bounds, while requiring less information about the underlying network. It also performs better empirically. △ Less

Submitted 24 October, 2019; v1 submitted 10 October, 2018; originally announced October 2018.

arXiv:1808.01857 [pdf, other]

Statistical Windows in Testing for the Initial Distribution of a Reversible Markov Chain

Authors: Quentin Berthet, Varun Kanade

Abstract: We study the problem of hypothesis testing between two discrete distributions, where we only have access to samples after the action of a known reversible Markov chain, playing the role of noise. We derive instance-dependent minimax rates for the sample complexity of this problem, and show how its dependence in time is related to the spectral properties of the Markov chain. We show that there exis… ▽ More We study the problem of hypothesis testing between two discrete distributions, where we only have access to samples after the action of a known reversible Markov chain, playing the role of noise. We derive instance-dependent minimax rates for the sample complexity of this problem, and show how its dependence in time is related to the spectral properties of the Markov chain. We show that there exists a wide statistical window, in terms of sample complexity for hypothesis testing between different pairs of initial distributions. We illustrate these results in several concrete examples. △ Less

Submitted 6 August, 2018; originally announced August 2018.

MSC Class: 62C20

arXiv:1806.03461 [pdf, other]

TAPAS: Tricks to Accelerate (encrypted) Prediction As a Service

Authors: Amartya Sanyal, Matt J. Kusner, Adrià Gascón, Varun Kanade

Abstract: Machine learning methods are widely used for a variety of prediction problems. \emph{Prediction as a service} is a paradigm in which service providers with technological expertise and computational resources may perform predictions for clients. However, data privacy severely restricts the applicability of such services, unless measures to keep client data private (even from the service provider) a… ▽ More Machine learning methods are widely used for a variety of prediction problems. \emph{Prediction as a service} is a paradigm in which service providers with technological expertise and computational resources may perform predictions for clients. However, data privacy severely restricts the applicability of such services, unless measures to keep client data private (even from the service provider) are designed. Equally important is to minimize the amount of computation and communication required between client and server. Fully homomorphic encryption offers a possible way out, whereby clients may encrypt their data, and on which the server may perform arithmetic computations. The main drawback of using fully homomorphic encryption is the amount of time required to evaluate large machine learning models on encrypted data. We combine ideas from the machine learning literature, particularly work on binarization and sparsification of neural networks, together with algorithmic tools to speed-up and parallelize computation using encrypted data. △ Less

Submitted 9 June, 2018; originally announced June 2018.

Comments: Accepted at International Conference in Machine Learning (ICML), 2018

arXiv:1804.07090 [pdf, other]

Robustness via Deep Low-Rank Representations

Authors: Amartya Sanyal, Varun Kanade, Philip H. S. Torr, Puneet K. Dokania

Abstract: We investigate the effect of the dimensionality of the representations learned in Deep Neural Networks (DNNs) on their robustness to input perturbations, both adversarial and random. To achieve low dimensionality of learned representations, we propose an easy-to-use, end-to-end trainable, low-rank regularizer (LR) that can be applied to any intermediate layer representation of a DNN. This regulari… ▽ More We investigate the effect of the dimensionality of the representations learned in Deep Neural Networks (DNNs) on their robustness to input perturbations, both adversarial and random. To achieve low dimensionality of learned representations, we propose an easy-to-use, end-to-end trainable, low-rank regularizer (LR) that can be applied to any intermediate layer representation of a DNN. This regularizer forces the feature representations to (mostly) lie in a low-dimensional linear subspace. We perform a wide range of experiments that demonstrate that the LR indeed induces low rank on the representations, while providing modest improvements to accuracy as an added benefit. Furthermore, the learned features make the trained model significantly more robust to input perturbations such as Gaussian and adversarial noise (even without adversarial training). Lastly, the low-dimensionality means that the learned features are highly compressible; thus discriminative features of the data can be stored using very little memory. Our experiments indicate that models trained using the LR learn robust classifiers by discovering subspaces that avoid non-robust features. Algorithmically, the LR is scalable, generic, and straightforward to implement into existing deep learning frameworks. △ Less

Submitted 19 February, 2020; v1 submitted 19 April, 2018; originally announced April 2018.

arXiv:1802.05690 [pdf, ps, other]

Learning DNFs under product distributions via μ-biased quantum Fourier sampling

Authors: Varun Kanade, Andrea Rocchetto, Simone Severini

Abstract: We show that DNF formulae can be quantum PAC-learned in polynomial time under product distributions using a quantum example oracle. The best classical algorithm (without access to membership queries) runs in superpolynomial time. Our result extends the work by Bshouty and Jackson (1998) that proved that DNF formulae are efficiently learnable under the uniform distribution using a quantum example o… ▽ More We show that DNF formulae can be quantum PAC-learned in polynomial time under product distributions using a quantum example oracle. The best classical algorithm (without access to membership queries) runs in superpolynomial time. Our result extends the work by Bshouty and Jackson (1998) that proved that DNF formulae are efficiently learnable under the uniform distribution using a quantum example oracle. Our proof is based on a new quantum algorithm that efficiently samples the coefficients of a μ-biased Fourier transform. △ Less

Submitted 25 November, 2019; v1 submitted 15 February, 2018; originally announced February 2018.

Comments: 17 pages; v3 based on journal version; minor corrections and clarifications

Journal ref: Quantum Information and Computation, Vol. 19, No. 15&16 (2019) 1261-1278

arXiv:1711.00982 [pdf, other]

From which world is your graph?

Authors: Cheng Li, Felix Wong, Zhenming Liu, Varun Kanade

Abstract: Discovering statistical structure from links is a fundamental problem in the analysis of social networks. Choosing a misspecified model, or equivalently, an incorrect inference algorithm will result in an invalid analysis or even falsely uncover patterns that are in fact artifacts of the model. This work focuses on unifying two of the most widely used link-formation models: the stochastic blockmod… ▽ More Discovering statistical structure from links is a fundamental problem in the analysis of social networks. Choosing a misspecified model, or equivalently, an incorrect inference algorithm will result in an invalid analysis or even falsely uncover patterns that are in fact artifacts of the model. This work focuses on unifying two of the most widely used link-formation models: the stochastic blockmodel (SBM) and the small world (or latent space) model (SWM). Integrating techniques from kernel learning, spectral graph theory, and nonlinear dimensionality reduction, we develop the first statistically sound polynomial-time algorithm to discover latent patterns in sparse graphs for both models. When the network comes from an SBM, the algorithm outputs a block structure. When it is from an SWM, the algorithm outputs estimates of each node's latent position. △ Less

Submitted 2 November, 2017; originally announced November 2017.

Comments: To appear in NIPS 2017

arXiv:1704.02147 [pdf, ps, other]

Hierarchical Clustering: Objective Functions and Algorithms

Authors: Vincent Cohen-Addad, Varun Kanade, Frederik Mallmann-Trenn, Claire Mathieu

Abstract: Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a `good' hierarchical clustering is… ▽ More Hierarchical clustering is a recursive partitioning of a dataset into clusters at an increasingly finer granularity. Motivated by the fact that most work on hierarchical clustering was based on providing algorithms, rather than optimizing a specific objective, Dasgupta framed similarity-based hierarchical clustering as a combinatorial optimization problem, where a `good' hierarchical clustering is one that minimizes some cost function. He showed that this cost function has certain desirable properties. We take an axiomatic approach to defining `good' objective functions for both similarity and dissimilarity-based hierarchical clustering. We characterize a set of "admissible" objective functions (that includes Dasgupta's one) that have the property that when the input admits a `natural' hierarchical clustering, it has an optimal value. Equipped with a suitable objective function, we analyze the performance of practical algorithms, as well as develop better algorithms. For similarity-based hierarchical clustering, Dasgupta showed that the divisive sparsest-cut approach achieves an $O(\log^{3/2} n)$-approximation. We give a refined analysis of the algorithm and show that it in fact achieves an $O(\sqrt{\log n})$-approx. (Charikar and Chatziafratis independently proved that it is a $O(\sqrt{\log n})$-approx.). This improves upon the LP-based $O(\log n)$-approx. of Roy and Pokutta. For dissimilarity-based hierarchical clustering, we show that the classic average-linkage algorithm gives a factor 2 approx., and provide a simple and better algorithm that gives a factor 3/2 approx.. Finally, we consider `beyond-worst-case' scenario through a generalisation of the stochastic block model for hierarchical clustering. We show that Dasgupta's cost function has desirable properties for these inputs and we provide a simple 1 + o(1)-approximation in this setting. △ Less

Submitted 7 April, 2017; originally announced April 2017.

arXiv:1702.03959 [pdf, other]

How large is your graph?

Authors: Varun Kanade, Frederik Mallmann-Trenn, Victor Verdugo

Abstract: We consider the problem of estimating the graph size, where one is given only local access to the graph. We formally define a query model in which one starts with a \emph{seed} node and is allowed to make queries about neighbours of nodes that have already been seen. In the case of undirected graphs, an estimator of Katzir et al. (2014) based on a sample from the stationary distribution $π$ uses… ▽ More We consider the problem of estimating the graph size, where one is given only local access to the graph. We formally define a query model in which one starts with a \emph{seed} node and is allowed to make queries about neighbours of nodes that have already been seen. In the case of undirected graphs, an estimator of Katzir et al. (2014) based on a sample from the stationary distribution $π$ uses $O\left(\frac{1}{\|π\|_2} + \text{davg}\right)$ queries, we prove that this is tight. In addition, we establish this as a lower bound even when the algorithm is allowed to crawl the graph arbitrarily, the results of Katzir et al. give an upper bound that is worse by a multiplicative factor $t_\text{mix} \cdot \log(n)$. The picture becomes significantly different in the case of directed graphs. We show that without strong assumptions on the graph structure, the number of nodes cannot be predicted to within a constant multiplicative factor without using a number of queries that are at least linear in the number of nodes, in particular, rapid mixing and small diameter, properties that most real-world networks exhibit, do not suffice. The question of interest is whether any algorithm can beat breadth-first search. We introduce a new parameter, generalising the well-studied conductance, such that if a suitable bound on it exists and is known to the algorithm, the number of queries required is sublinear in the number of edges, we show that this is tight. △ Less

Submitted 13 February, 2017; originally announced February 2017.

arXiv:1611.10258 [pdf, ps, other]

Reliably Learning the ReLU in Polynomial Time

Authors: Surbhi Goel, Varun Kanade, Adam Klivans, Justin Thaler

Abstract: We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \max(0, \mathbf{w} \cdot \mathbf{x})$ with $\mathbf{w} \in \mathbb{S}^{n-1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $\cal{D}$ on la… ▽ More We give the first dimension-efficient algorithms for learning Rectified Linear Units (ReLUs), which are functions of the form $\mathbf{x} \mapsto \max(0, \mathbf{w} \cdot \mathbf{x})$ with $\mathbf{w} \in \mathbb{S}^{n-1}$. Our algorithm works in the challenging Reliable Agnostic learning model of Kalai, Kanade, and Mansour (2009) where the learner is given access to a distribution $\cal{D}$ on labeled examples but the labeling may be arbitrary. We construct a hypothesis that simultaneously minimizes the false-positive rate and the loss on inputs given positive labels by $\cal{D}$, for any convex, bounded, and Lipschitz loss function. The algorithm runs in polynomial-time (in $n$) with respect to any distribution on $\mathbb{S}^{n-1}$ (the unit sphere in $n$ dimensions) and for any error parameter $ε= Ω(1/\log n)$ (this yields a PTAS for a question raised by F. Bach on the complexity of maximizing ReLUs). These results are in contrast to known efficient algorithms for reliably learning linear threshold functions, where $ε$ must be $Ω(1)$ and strong assumptions are required on the marginal distribution. We can compose our results to obtain the first set of efficient algorithms for learning constant-depth networks of ReLUs. Our techniques combine kernel methods and polynomial approximations with a "dual-loss" approach to convex programming. As a byproduct we obtain a number of applications including the first set of efficient algorithms for "convex piecewise-linear fitting" and the first efficient algorithms for noisy polynomial reconstruction of low-weight polynomials on the unit sphere. △ Less

Submitted 30 November, 2016; originally announced November 2016.

arXiv:1611.02460 [pdf, other]

On coalescence time in graphs--When is coalescing as fast as meeting?

Authors: Varun Kanade, Frederik Mallmann-Trenn, Thomas Sauerwald

Abstract: Coalescing random walks is a fundamental stochastic process, where a set of particles perform independent discrete-time random walks on an undirected graph. Whenever two or more particles meet at a given node, they merge and continue as a single random walk. The coalescence time is defined as the expected time until only one particle remains, starting from one particle at every node. Despite recen… ▽ More Coalescing random walks is a fundamental stochastic process, where a set of particles perform independent discrete-time random walks on an undirected graph. Whenever two or more particles meet at a given node, they merge and continue as a single random walk. The coalescence time is defined as the expected time until only one particle remains, starting from one particle at every node. Despite recent progress the coalescence time for graphs such as binary trees, d-dimensional tori, hypercubes and more generally, vertex-transitive graphs, remains unresolved. We provide a powerful toolkit that results in tight bounds for various topologies including the aforementioned ones. The meeting time is defined as the worst-case expected time required for two random walks to arrive at the same node at the same time. As a general result, we establish that for graphs whose meeting time is only marginally larger than the mixing time (a factor of log^2 n), the coalescence time of n random walks equals the meeting time up to constant factors. This upper bound is complemented by the construction of a graph family demonstrating that this result is the best possible up to constant factors. For almost-regular graphs, we bound the coalescence time by the hitting time, resolving the discrete-time variant of a conjecture by Aldous for this class of graphs. Finally, we prove that for any graph the coalescence time is bounded by O(n^3) (which is tight for the Barbell graph); surprisingly even such a basic question about the coalescing time was not answered before this work. By duality, our results give bounds on the voter model and therefore give bounds on the consensus time in arbitrary undirected graphs. We also establish a new bound on the hitting time and cover time of regular graphs, improving and tightening previous results by Broder and Karlin, as well as those by Aldous and Fill. △ Less

Submitted 1 November, 2018; v1 submitted 8 November, 2016; originally announced November 2016.

arXiv:1604.01999 [pdf, other]

Online Optimization of Smoothed Piecewise Constant Functions

Authors: Vincent Cohen-Addad, Varun Kanade

Abstract: We study online optimization of smoothed piecewise constant functions over the domain [0, 1). This is motivated by the problem of adaptively picking parameters of learning algorithms as in the recently introduced framework by Gupta and Roughgarden (2016). Majority of the machine learning literature has focused on Lipschitz-continuous functions or functions with bounded gradients. 1 This is with go… ▽ More We study online optimization of smoothed piecewise constant functions over the domain [0, 1). This is motivated by the problem of adaptively picking parameters of learning algorithms as in the recently introduced framework by Gupta and Roughgarden (2016). Majority of the machine learning literature has focused on Lipschitz-continuous functions or functions with bounded gradients. 1 This is with good reason---any learning algorithm suffers linear regret even against piecewise constant functions that are chosen adversarially, arguably the simplest of non-Lipschitz continuous functions. The smoothed setting we consider is inspired by the seminal work of Spielman and Teng (2004) and the recent work of Gupta and Roughgarden---in this setting, the sequence of functions may be chosen by an adversary, however, with some uncertainty in the location of discontinuities. We give algorithms that achieve sublinear regret in the full information and bandit settings. △ Less

Submitted 20 May, 2016; v1 submitted 7 April, 2016; originally announced April 2016.

arXiv:1509.01988 [pdf, ps, other]

Stable Matching with Evolving Preferences

Authors: Varun Kanade, Nikos Leonardos, Frédéric Magniez

Abstract: We consider the problem of stable matching with dynamic preference lists. At each time step, the preference list of some player may change by swap** random adjacent members. The goal of a central agency (algorithm) is to maintain an approximately stable matching (in terms of number of blocking pairs) at all times. The changes in the preference lists are not reported to the algorithm, but must in… ▽ More We consider the problem of stable matching with dynamic preference lists. At each time step, the preference list of some player may change by swap** random adjacent members. The goal of a central agency (algorithm) is to maintain an approximately stable matching (in terms of number of blocking pairs) at all times. The changes in the preference lists are not reported to the algorithm, but must instead be probed explicitly by the algorithm. We design an algorithm that in expectation and with high probability maintains a matching that has at most $O((log (n))^2)$ blocking pairs. △ Less

Submitted 28 June, 2016; v1 submitted 7 September, 2015; originally announced September 2015.

Comments: 13 pages

arXiv:1505.05215 [pdf, ps, other]

Learning with a Drifting Target Concept

Authors: Steve Hanneke, Varun Kanade, Liu Yang

Abstract: We study the problem of learning in the presence of a drifting target concept. Specifically, we provide bounds on the error rate at a given time, given a learner with access to a history of independent samples labeled according to a target concept that can change on each round. One of our main contributions is a refinement of the best previous results for polynomial-time algorithms for the space o… ▽ More We study the problem of learning in the presence of a drifting target concept. Specifically, we provide bounds on the error rate at a given time, given a learner with access to a history of independent samples labeled according to a target concept that can change on each round. One of our main contributions is a refinement of the best previous results for polynomial-time algorithms for the space of linear separators under a uniform distribution. We also provide general results for an algorithm capable of adapting to a variable rate of drift of the target concept. Some of the results also describe an active learning variant of this setting, and provide bounds on the number of queries for the labels of points in the sequence sufficient to obtain the stated bounds on the error rates. △ Less

Submitted 19 May, 2015; originally announced May 2015.

arXiv:1404.6325 [pdf, other]

Global and Local Information in Clustering Labeled Block Models

Authors: Varun Kanade, Elchanan Mossel, Tselil Schramm

Abstract: The stochastic block model is a classical cluster-exhibiting random graph model that has been widely studied in statistics, physics and computer science. In its simplest form, the model is a random graph with two equal-sized clusters, with intra-cluster edge probability p, and inter-cluster edge probability q. We focus on the sparse case, i.e., p, q = O(1/n), which is practically more relevant and… ▽ More The stochastic block model is a classical cluster-exhibiting random graph model that has been widely studied in statistics, physics and computer science. In its simplest form, the model is a random graph with two equal-sized clusters, with intra-cluster edge probability p, and inter-cluster edge probability q. We focus on the sparse case, i.e., p, q = O(1/n), which is practically more relevant and also mathematically more challenging. A conjecture of Decelle, Krzakala, Moore and Zdeborova, based on ideas from statistical physics, predicted a specific threshold for clustering. The negative direction of the conjecture was proved by Mossel, Neeman and Sly (2012), and more recently the positive direction was proven independently by Massoulie and Mossel, Neeman, and Sly. In many real network clustering problems, nodes contain information as well. We study the interplay between node and network information in clustering by studying a labeled block model, where in addition to the edge information, the true cluster labels of a small fraction of the nodes are revealed. In the case of two clusters, we show that below the threshold, a small amount of node information does not affect recovery. On the other hand, we show that for any small amount of information efficient local clustering is achievable as long as the number of clusters is sufficiently large (as a function of the amount of revealed information). △ Less

Submitted 3 July, 2014; v1 submitted 25 April, 2014; originally announced April 2014.

Comments: 24 pages, 2 figures. A short abstract describing these results will appear in proceedings of RANDOM 2014

arXiv:1402.5164 [pdf, ps, other]

Distribution-Independent Reliable Learning

Authors: Varun Kanade, Justin Thaler

Abstract: We study several questions in the reliable agnostic learning framework of Kalai et al. (2009), which captures learning tasks in which one type of error is costlier than others. A positive reliable classifier is one that makes no false positive errors. The goal in the positive reliable agnostic framework is to output a hypothesis with the following properties: (i) its false positive error rate is a… ▽ More We study several questions in the reliable agnostic learning framework of Kalai et al. (2009), which captures learning tasks in which one type of error is costlier than others. A positive reliable classifier is one that makes no false positive errors. The goal in the positive reliable agnostic framework is to output a hypothesis with the following properties: (i) its false positive error rate is at most $ε$, (ii) its false negative error rate is at most $ε$ more than that of the best positive reliable classifier from the class. A closely related notion is fully reliable agnostic learning, which considers partial classifiers that are allowed to predict "unknown" on some inputs. The best fully reliable partial classifier is one that makes no errors and minimizes the probability of predicting "unknown", and the goal in fully reliable learning is to output a hypothesis that is almost as good as the best fully reliable partial classifier from a class. For distribution-independent learning, the best known algorithms for PAC learning typically utilize polynomial threshold representations, while the state of the art agnostic learning algorithms use point-wise polynomial approximations. We show that one-sided polynomial approximations, an intermediate notion between polynomial threshold representations and point-wise polynomial approximations, suffice for learning in the reliable agnostic settings. We then show that majorities can be fully reliably learned and disjunctions of majorities can be positive reliably learned, through constructions of appropriate one-sided polynomial approximations. Our fully reliable algorithm for majorities provides the first evidence that fully reliable learning may be strictly easier than agnostic learning. Our algorithms also satisfy strong attribute-efficiency properties, and provide smooth tradeoffs between sample complexity and running time. △ Less

Submitted 20 February, 2014; originally announced February 2014.

Comments: 20 pages

arXiv:1309.4132 [pdf, other]

Attribute-Efficient Evolvability of Linear Functions

Authors: Elaine Angelino, Varun Kanade

Abstract: In a seminal paper, Valiant (2006) introduced a computational model for evolution to address the question of complexity that can arise through Darwinian mechanisms. Valiant views evolution as a restricted form of computational learning, where the goal is to evolve a hypothesis that is close to the ideal function. Feldman (2008) showed that (correlational) statistical query learning algorithms coul… ▽ More In a seminal paper, Valiant (2006) introduced a computational model for evolution to address the question of complexity that can arise through Darwinian mechanisms. Valiant views evolution as a restricted form of computational learning, where the goal is to evolve a hypothesis that is close to the ideal function. Feldman (2008) showed that (correlational) statistical query learning algorithms could be framed as evolutionary mechanisms in Valiant's model. P. Valiant (2012) considered evolvability of real-valued functions and also showed that weak-optimization algorithms that use weak-evaluation oracles could be converted to evolutionary mechanisms. In this work, we focus on the complexity of representations of evolutionary mechanisms. In general, the reductions of Feldman and P. Valiant may result in intermediate representations that are arbitrarily complex (polynomial-sized circuits). We argue that biological constraints often dictate that the representations have low complexity, such as constant depth and fan-in circuits. We give mechanisms for evolving sparse linear functions under a large class of smooth distributions. These evolutionary algorithms are attribute-efficient in the sense that the size of the representations and the number of generations required depend only on the sparsity of the target function and the accuracy parameter, but have no dependence on the total number of attributes. △ Less

Submitted 3 April, 2014; v1 submitted 16 September, 2013; originally announced September 2013.

arXiv:1307.3617 [pdf, other]

MCMC Learning

Authors: Varun Kanade, Elchanan Mossel

Abstract: The theory of learning under the uniform distribution is rich and deep, with connections to cryptography, computational complexity, and the analysis of boolean functions to name a few areas. This theory however is very limited due to the fact that the uniform distribution and the corresponding Fourier basis are rarely encountered as a statistical model. A family of distributions that vastly gene… ▽ More The theory of learning under the uniform distribution is rich and deep, with connections to cryptography, computational complexity, and the analysis of boolean functions to name a few areas. This theory however is very limited due to the fact that the uniform distribution and the corresponding Fourier basis are rarely encountered as a statistical model. A family of distributions that vastly generalizes the uniform distribution on the Boolean cube is that of distributions represented by Markov Random Fields (MRF). Markov Random Fields are one of the main tools for modeling high dimensional data in many areas of statistics and machine learning. In this paper we initiate the investigation of extending central ideas, methods and algorithms from the theory of learning under the uniform distribution to the setup of learning concepts given examples from MRF distributions. In particular, our results establish a novel connection between properties of MCMC sampling of MRFs and learning under the MRF distribution. △ Less

Submitted 12 June, 2015; v1 submitted 13 July, 2013; originally announced July 2013.

Comments: 28 pages, 1 figure

arXiv:1211.3212 [pdf, other]

Distributed Non-Stochastic Experts

Authors: Varun Kanade, Zhenming Liu, Bozidar Radunovic

Abstract: We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to $k$ sites, and the sites are required to communicate with each other via the coordinator. At each time-step $t$, one of the $k$ site nodes has to pick an expert from the set ${1, ..., n}$, and the same site receives information about payoffs of all e… ▽ More We consider the online distributed non-stochastic experts problem, where the distributed system consists of one coordinator node that is connected to $k$ sites, and the sites are required to communicate with each other via the coordinator. At each time-step $t$, one of the $k$ site nodes has to pick an expert from the set ${1, ..., n}$, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon $T$, while simultaneously kee** communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the non-distributed setting to obtain the optimal $O(\sqrt{\log(n)T})$ regret bound at the cost of $T$ communication. (ii) No communication: Each site runs an independent copy : the regret is $O(\sqrt{log(n)kT})$ and the communication is 0. This paper shows the difficulty of simultaneously achieving regret asymptotically better than $\sqrt{kT}$ and communication better than $T$. We give a novel algorithm that for an oblivious adversary achieves a non-trivial trade-off: regret $O(\sqrt{k^{5(1+ε)/6} T})$ and communication $O(T/k^ε)$, for any value of $ε\in (0, 1/5)$. We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication trade-off. △ Less

Submitted 14 November, 2012; originally announced November 2012.

arXiv:1211.0996 [pdf, ps, other]

Learning using Local Membership Queries

Authors: Pranjal Awasthi, Vitaly Feldman, Varun Kanade

Abstract: We introduce a new model of membership query (MQ) learning, where the learning algorithm is restricted to query points that are \emph{close} to random examples drawn from the underlying distribution. The learning model is intermediate between the PAC model (Valiant, 1984) and the PAC+MQ model (where the queries are allowed to be arbitrary points). Membership query algorithms are not popular amon… ▽ More We introduce a new model of membership query (MQ) learning, where the learning algorithm is restricted to query points that are \emph{close} to random examples drawn from the underlying distribution. The learning model is intermediate between the PAC model (Valiant, 1984) and the PAC+MQ model (where the queries are allowed to be arbitrary points). Membership query algorithms are not popular among machine learning practitioners. Apart from the obvious difficulty of adaptively querying labelers, it has also been observed that querying \emph{unnatural} points leads to increased noise from human labelers (Lang and Baum, 1992). This motivates our study of learning algorithms that make queries that are close to examples generated from the data distribution. We restrict our attention to functions defined on the $n$-dimensional Boolean hypercube and say that a membership query is local if its Hamming distance from some example in the (random) training data is at most $O(\log(n))$. We show the following results in this model: (i) The class of sparse polynomials (with coefficients in R) over $\{0,1\}^n$ is polynomial time learnable under a large class of \emph{locally smooth} distributions using $O(\log(n))$-local queries. This class also includes the class of $O(\log(n))$-depth decision trees. (ii) The class of polynomial-sized decision trees is polynomial time learnable under product distributions using $O(\log(n))$-local queries. (iii) The class of polynomial size DNF formulas is learnable under the uniform distribution using $O(\log(n))$-local queries in time $n^{O(\log(\log(n)))}$. (iv) In addition we prove a number of results relating the proposed model to the traditional PAC model and the PAC+MQ model. △ Less

Submitted 17 April, 2013; v1 submitted 5 November, 2012; originally announced November 2012.

arXiv:1104.2018 [pdf, other]

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Authors: Sham Kakade, Adam Tauman Kalai, Varun Kanade, Ohad Shamir

Abstract: Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) recently pr… ▽ More Generalized Linear Models (GLMs) and Single Index Models (SIMs) provide powerful generalizations of linear regression, where the target variable is assumed to be a (possibly unknown) 1-dimensional function of a linear predictor. In general, these problems entail non-convex estimation procedures, and, in practice, iterative local search heuristics are often used. Kalai and Sastry (2009) recently provided the first provably efficient method for learning SIMs and GLMs, under the assumptions that the data are in fact generated under a GLM and under certain monotonicity and Lipschitz constraints. However, to obtain provable performance, the method requires a fresh sample every iteration. In this paper, we provide algorithms for learning GLMs and SIMs, which are both computationally and statistically efficient. We also provide an empirical study, demonstrating their feasibility in practice. △ Less

Submitted 11 April, 2011; originally announced April 2011.

arXiv:1005.3566 [pdf, ps, other]

Evolution with Drifting Targets

Authors: Varun Kanade, Leslie G. Valiant, Jennifer Wortman Vaughan

Abstract: We consider the question of the stability of evolutionary algorithms to gradual changes, or drift, in the target concept. We define an algorithm to be resistant to drift if, for some inverse polynomial drift rate in the target function, it converges to accuracy 1 -- ε, with polynomial resources, and then stays within that accuracy indefinitely, except with probability ε, at any one time. We show t… ▽ More We consider the question of the stability of evolutionary algorithms to gradual changes, or drift, in the target concept. We define an algorithm to be resistant to drift if, for some inverse polynomial drift rate in the target function, it converges to accuracy 1 -- ε, with polynomial resources, and then stays within that accuracy indefinitely, except with probability ε, at any one time. We show that every evolution algorithm, in the sense of Valiant (2007; 2009), can be converted using the Correlational Query technique of Feldman (2008), into such a drift resistant algorithm. For certain evolutionary algorithms, such as for Boolean conjunctions, we give bounds on the rates of drift that they can resist. We develop some new evolution algorithms that are resistant to significant drift. In particular, we give an algorithm for evolving linear separators over the spherically symmetric distribution that is resistant to a drift rate of O(ε/n), and another algorithm over the more general product normal distributions that resists a smaller drift rate. The above translation result can be also interpreted as one on the robustness of the notion of evolvability itself under changes of definition. As a second result in that direction we show that every evolution algorithm can be converted to a quasi-monotonic one that can evolve from any starting point without the performance ever dip** significantly below that of the starting point. This permits the somewhat unnatural feature of arbitrary performance degradations to be removed from several known robustness translations. △ Less

Submitted 19 May, 2010; originally announced May 2010.

Showing 1–39 of 39 results for author: Kanade, V