Search | arXiv e-print repository

Second-order Information Promotes Mini-Batch Robustness in Variance-Reduced Gradients

Authors: Sachin Garg, Albert S. Berahas, Michał Dereziński

Abstract: We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-or… ▽ More We show that, for finite-sum minimization problems, incorporating partial second-order information of the objective function can dramatically improve the robustness to mini-batch size of variance-reduced stochastic gradient methods, making them more scalable while retaining their benefits over traditional Newton-type approaches. We demonstrate this phenomenon on a prototypical stochastic second-order algorithm, called Mini-Batch Stochastic Variance-Reduced Newton ($\texttt{Mb-SVRN}$), which combines variance-reduced gradient estimates with access to an approximate Hessian oracle. In particular, we show that when the data size $n$ is sufficiently large, i.e., $n\gg α^2κ$, where $κ$ is the condition number and $α$ is the Hessian approximation factor, then $\texttt{Mb-SVRN}$ achieves a fast linear convergence rate that is independent of the gradient mini-batch size $b$, as long $b$ is in the range between $1$ and $b_{\max}=O(n/(α\log n))$. Only after increasing the mini-batch size past this critical point $b_{\max}$, the method begins to transition into a standard Newton-type algorithm which is much more sensitive to the Hessian approximation quality. We demonstrate this phenomenon empirically on benchmark optimization tasks showing that, after tuning the step size, the convergence rate of $\texttt{Mb-SVRN}$ remains fast for a wide range of mini-batch sizes, and the dependence of the phase transition point $b_{\max}$ on the Hessian approximation factor $α$ aligns with our theoretical predictions. △ Less

Submitted 23 April, 2024; originally announced April 2024.

MSC Class: 65K05; 90C06; 90C30

arXiv:2404.07815 [pdf, other]

Post-Hoc Reversal: Are We Selecting Models Prematurely?

Authors: Rishabh Ranjan, Saurabh Garg, Mrigank Raman, Carlos Guestrin, Zachary Chase Lipton

Abstract: Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical st… ▽ More Trained models are often composed with post-hoc transforms such as temperature scaling (TS), ensembling and stochastic weight averaging (SWA) to improve performance, robustness, uncertainty estimation, etc. However, such transforms are typically applied only after the base models have already been finalized by standard means. In this paper, we challenge this practice with an extensive empirical study. In particular, we demonstrate a phenomenon that we call post-hoc reversal, where performance trends are reversed after applying these post-hoc transforms. This phenomenon is especially prominent in high-noise settings. For example, while base models overfit badly early in training, both conventional ensembling and SWA favor base models trained for more epochs. Post-hoc reversal can also suppress the appearance of double descent and mitigate mismatches between test loss and test error seen in base models. Based on our findings, we propose post-hoc selection, a simple technique whereby post-hoc metrics inform model development decisions such as early stop**, checkpointing, and broader hyperparameter choices. Our experimental analyses span real-world vision, language, tabular and graph datasets from domains like satellite imaging, language modeling, census prediction and social network analysis. On an LLM instruction tuning dataset, post-hoc selection results in > 1.5x MMLU improvement compared to naive selection. Code is available at https://github.com/rishabh-ranjan/post-hoc-reversal. △ Less

Submitted 11 April, 2024; originally announced April 2024.

Comments: 9 pages + references + appendix, 7 figures

arXiv:2402.16926 [pdf, other]

On the (In)feasibility of ML Backdoor Detection as an Hypothesis Testing Problem

Authors: Georg Pichler, Marco Romanelli, Divya Prakash Manivannan, Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg

Abstract: We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that… ▽ More We introduce a formal statistical definition for the problem of backdoor detection in machine learning systems and use it to analyze the feasibility of such problems, providing evidence for the utility and applicability of our definition. The main contributions of this work are an impossibility result and an achievability result for backdoor detection. We show a no-free-lunch theorem, proving that universal (adversary-unaware) backdoor detection is impossible, except for very small alphabet sizes. Thus, we argue, that backdoor detection methods need to be either explicitly, or implicitly adversary-aware. However, our work does not imply that backdoor detection cannot work in specific scenarios, as evidenced by successful backdoor detection methods in the scientific literature. Furthermore, we connect our definition to the probably approximately correct (PAC) learnability of the out-of-distribution detection problem. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2312.03318 [pdf, other]

Complementary Benefits of Contrastive Learning and Self-Training Under Distribution Shift

Authors: Saurabh Garg, Amrith Setlur, Zachary Chase Lipton, Sivaraman Balakrishnan, Virginia Smith, Aditi Raghunathan

Abstract: Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investi… ▽ More Self-training and contrastive learning have emerged as leading techniques for incorporating unlabeled data, both under distribution shift (unsupervised domain adaptation) and when it is absent (semi-supervised learning). However, despite the popularity and compatibility of these techniques, their efficacy in combination remains unexplored. In this paper, we undertake a systematic empirical investigation of this combination, finding that (i) in domain adaptation settings, self-training and contrastive learning offer significant complementary gains; and (ii) in semi-supervised learning settings, surprisingly, the benefits are not synergistic. Across eight distribution shift datasets (e.g., BREEDs, WILDS), we demonstrate that the combined method obtains 3--8% higher accuracy than either approach independently. We then theoretically analyze these techniques in a simplified model of distribution shift, demonstrating scenarios under which the features produced by contrastive learning can yield a good initialization for self-training to further amplify gains and achieve optimal performance, even when either method alone would fail. △ Less

Submitted 6 December, 2023; originally announced December 2023.

Comments: NeurIPS 2023

arXiv:2311.11194 [pdf, other]

Testing with Non-identically Distributed Samples

Authors: Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant

Abstract: We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T$, and we obtain $c$ indepen… ▽ More We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed. Specifically, we consider the following distributional property testing framework: Suppose there is a set of distributions over a discrete support of size $k$, $\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T$, and we obtain $c$ independent draws from each distribution. Suppose the goal is to learn or test a property of the average distribution, $\textbf{p}_{\mathrm{avg}}$. This setup models a number of important practical settings where the individual distributions correspond to heterogeneous entities -- either individuals, chronologically distinct time periods, spatially separated data sources, etc. From a learning standpoint, even with $c=1$ samples from each distribution, $Θ(k/\varepsilon^2)$ samples are necessary and sufficient to learn $\textbf{p}_{\mathrm{avg}}$ to within error $\varepsilon$ in TV distance. To test uniformity or identity -- distinguishing the case that $\textbf{p}_{\mathrm{avg}}$ is equal to some reference distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the reference distribution, we show that a linear number of samples in $k$ is necessary given $c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover the usual sublinear sample testing of the i.i.d. setting: we show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ samples are sufficient, matching the optimal sample complexity in the i.i.d. case in the regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there is a constant $ρ> 0$ such that even in the linear regime with $ρk$ samples, no tester that considers the multiset of samples (ignoring which samples were drawn from the same $\textbf{p}_i$) can perform uniformity testing. △ Less

Submitted 18 November, 2023; originally announced November 2023.

arXiv:2307.08999 [pdf, ps, other]

Oracle Efficient Online Multicalibration and Omniprediction

Authors: Sumegha Garg, Christopher Jung, Omer Reingold, Aaron Roth

Abstract: A recent line of work has shown a surprising connection between multicalibration, a multi-group fairness notion, and omniprediction, a learning paradigm that provides simultaneous loss minimization guarantees for a large family of loss functions. Prior work studies omniprediction in the batch setting. We initiate the study of omniprediction in the online adversarial setting. Although there exist a… ▽ More A recent line of work has shown a surprising connection between multicalibration, a multi-group fairness notion, and omniprediction, a learning paradigm that provides simultaneous loss minimization guarantees for a large family of loss functions. Prior work studies omniprediction in the batch setting. We initiate the study of omniprediction in the online adversarial setting. Although there exist algorithms for obtaining notions of multicalibration in the online adversarial setting, unlike batch algorithms, they work only for small finite classes of benchmark functions $F$, because they require enumerating every function $f \in F$ at every round. In contrast, omniprediction is most interesting for learning theoretic hypothesis classes $F$, which are generally continuously large. We develop a new online multicalibration algorithm that is well defined for infinite benchmark classes $F$, and is oracle efficient (i.e. for any class $F$, the algorithm has the form of an efficient reduction to a no-regret learning algorithm for $F$). The result is the first efficient online omnipredictor -- an oracle efficient prediction algorithm that can be used to simultaneously obtain no regret guarantees to all Lipschitz convex loss functions. For the class $F$ of linear functions, we show how to make our algorithm efficient in the worst case. Also, we show upper and lower bounds on the extent to which our rates can be improved: our oracle efficient algorithm actually promises a stronger guarantee called swap-omniprediction, and we prove a lower bound showing that obtaining $O(\sqrt{T})$ bounds for swap-omniprediction is impossible in the online setting. On the other hand, we give a (non-oracle efficient) algorithm which can obtain the optimal $O(\sqrt{T})$ omniprediction bounds without going through multicalibration, giving an information theoretic separation between these two solution concepts. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2306.00312 [pdf, other]

(Almost) Provable Error Bounds Under Distribution Shift via Disagreement Discrepancy

Authors: Elan Rosenfeld, Saurabh Garg

Abstract: We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods either give bounds that are vacuous in practice or give estimates that are accurate on average but heavily underestimate error for a sizeable fraction of shifts. In particular, the latter only give guarantees based on complex continuous measures such a… ▽ More We derive an (almost) guaranteed upper bound on the error of deep neural networks under distribution shift using unlabeled test data. Prior methods either give bounds that are vacuous in practice or give estimates that are accurate on average but heavily underestimate error for a sizeable fraction of shifts. In particular, the latter only give guarantees based on complex continuous measures such as test calibration -- which cannot be identified without labels -- and are therefore unreliable. Instead, our bound requires a simple, intuitive condition which is well justified by prior empirical works and holds in practice effectively 100% of the time. The bound is inspired by $\mathcal{H}Δ\mathcal{H}$-divergence but is easier to evaluate and substantially tighter, consistently providing non-vacuous guarantees. Estimating the bound requires optimizing one multiclass classifier to disagree with another, for which some prior works have used sub-optimal proxy losses; we devise a "disagreement loss" which is theoretically justified and performs better in practice. We expect this loss can serve as a drop-in replacement for future methods which require maximizing multiclass disagreement. Across a wide range of benchmarks, our method gives valid error bounds while achieving average accuracy comparable to competitive estimation baselines. Code is publicly available at https://github.com/erosenfeld/disagree_discrep . △ Less

Submitted 31 May, 2023; originally announced June 2023.

arXiv:2305.19570 [pdf, other]

Online Label Shift: Optimal Dynamic Regret meets Practical Algorithms

Authors: Dheeraj Baby, Saurabh Garg, Tzu-Ching Yen, Sivaraman Balakrishnan, Zachary Chase Lipton, Yu-Xiang Wang

Abstract: This paper focuses on supervised and unsupervised online label shift, where the class marginals $Q(y)$ varies but the class-conditionals $Q(x|y)$ remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the… ▽ More This paper focuses on supervised and unsupervised online label shift, where the class marginals $Q(y)$ varies but the class-conditionals $Q(x|y)$ remain invariant. In the unsupervised setting, our goal is to adapt a learner, trained on some offline labeled data, to changing label distributions given unlabeled online data. In the supervised setting, we must both learn a classifier and adapt to the dynamically evolving class marginals given only labeled online data. We develop novel algorithms that reduce the adaptation problem to online regression and guarantee optimal dynamic regret without any prior knowledge of the extent of drift in the label distribution. Our solution is based on bootstrap** the estimates of \emph{online regression oracles} that track the drifting proportions. Experiments across numerous simulated and real-world online label shift scenarios demonstrate the superior performance of our proposed approaches, often achieving 1-3\% improvement in accuracy while being sample and computationally efficient. Code is publicly available at https://github.com/acmi-lab/OnlineLabelShift. △ Less

Submitted 31 May, 2023; originally announced May 2023.

Comments: First three authors contributed equally

arXiv:2302.03020 [pdf, other]

RLSbench: Domain Adaptation Under Relaxed Label Shift

Authors: Saurabh Garg, Nick Erickson, James Sharpnack, Alex Smola, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract: Despite the emergence of principled methods for domain adaptation under label shift, their sensitivity to shifts in class conditional distributions is precariously under explored. Meanwhile, popular deep domain adaptation heuristics tend to falter when faced with label proportions shifts. While several papers modify these heuristics in attempts to handle label proportions shifts, inconsistencies i… ▽ More Despite the emergence of principled methods for domain adaptation under label shift, their sensitivity to shifts in class conditional distributions is precariously under explored. Meanwhile, popular deep domain adaptation heuristics tend to falter when faced with label proportions shifts. While several papers modify these heuristics in attempts to handle label proportions shifts, inconsistencies in evaluation standards, datasets, and baselines make it difficult to gauge the current best practices. In this paper, we introduce RLSbench, a large-scale benchmark for relaxed label shift, consisting of $>$500 distribution shift pairs spanning vision, tabular, and language modalities, with varying label proportions. Unlike existing benchmarks, which primarily focus on shifts in class-conditional $p(x|y)$, our benchmark also focuses on label marginal shifts. First, we assess 13 popular domain adaptation methods, demonstrating more widespread failures under label proportion shifts than were previously known. Next, we develop an effective two-step meta-algorithm that is compatible with most domain adaptation heuristics: (i) pseudo-balance the data at each epoch; and (ii) adjust the final classifier with target label distribution estimate. The meta-algorithm improves existing domain adaptation heuristics under large label proportion shifts, often by 2--10\% accuracy points, while conferring minimal effect ($<$0.5\%) when label proportions do not shift. We hope that these findings and the availability of RLSbench will encourage researchers to rigorously evaluate proposed methods in relaxed label shift settings. Code is publicly available at https://github.com/acmi-lab/RLSbench. △ Less

Submitted 5 June, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

Comments: Accepted at ICML 2023. Paper website: https://sites.google.com/view/rlsbench/

arXiv:2302.01051 [pdf, other]

Randomized prior wavelet neural operator for uncertainty quantification

Authors: Shailesh Garg, Souvik Chakraborty

Abstract: In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with… ▽ More In this paper, we propose a novel data-driven operator learning framework referred to as the \textit{Randomized Prior Wavelet Neural Operator} (RP-WNO). The proposed RP-WNO is an extension of the recently proposed wavelet neural operator, which boasts excellent generalizing capabilities but cannot estimate the uncertainty associated with its predictions. RP-WNO, unlike the vanilla WNO, comes with inherent uncertainty quantification module and hence, is expected to be extremely useful for scientists and engineers alike. RP-WNO utilizes randomized prior networks, which can account for prior information and is easier to implement for large, complex deep-learning architectures than its Bayesian counterpart. Four examples have been solved to test the proposed framework, and the results produced advocate favorably for the efficacy of the proposed framework. △ Less

Submitted 2 February, 2023; originally announced February 2023.

arXiv:2208.12926 [pdf, ps, other]

Overparameterization from Computational Constraints

Authors: Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Mingyuan Wang

Abstract: Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded lea… ▽ More Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded learners need \emph{significantly more} model parameters than what information-theoretic learners need. Furthermore, we show that even more model parameters could be necessary for robust learning. In particular, for computationally bounded learners, we extend the recent result of Bubeck and Sellke [NeurIPS'2021] which shows that robust models might need more parameters, to the computational regime and show that bounded learners could provably need an even larger number of parameters. Then, we address the following related question: can we hope to remedy the situation for robust computationally bounded learning by restricting \emph{adversaries} to also be computationally bounded for sake of obtaining models with fewer parameters? Here again, we show that this could be possible. Specifically, building on the work of Garg, Jha, Mahloujifar, and Mahmoody [ALT'2020], we demonstrate a learning task that can be learned efficiently and robustly against a computationally bounded attacker, while to be robust against an information-theoretic attacker requires the learner to utilize significantly more parameters. △ Less

Submitted 15 October, 2022; v1 submitted 27 August, 2022; originally announced August 2022.

arXiv:2207.13179 [pdf, other]

Unsupervised Learning under Latent Label Shift

Authors: Manley Roberts, Pranav Mani, Saurabh Garg, Zachary C. Lipton

Abstract: What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class… ▽ More What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true grou**s, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work. △ Less

Submitted 1 December, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

Comments: NeurIPS 2022. Manley Roberts and Pranav Mani contributed equally to this work

arXiv:2206.05655 [pdf, other]

Variational Bayes Deep Operator Network: A data-driven Bayesian solver for parametric differential equations

Authors: Shailesh Garg, Souvik Chakraborty

Abstract: Neural network based data-driven operator learning schemes have shown tremendous potential in computational mechanics. DeepONet is one such neural network architecture which has gained widespread appreciation owing to its excellent prediction capabilities. Having said that, being set in a deterministic framework exposes DeepONet architecture to the risk of overfitting, poor generalization and in i… ▽ More Neural network based data-driven operator learning schemes have shown tremendous potential in computational mechanics. DeepONet is one such neural network architecture which has gained widespread appreciation owing to its excellent prediction capabilities. Having said that, being set in a deterministic framework exposes DeepONet architecture to the risk of overfitting, poor generalization and in its unaltered form, it is incapable of quantifying the uncertainties associated with its predictions. We propose in this paper, a Variational Bayes DeepONet (VB-DeepONet) for operator learning, which can alleviate these limitations of DeepONet architecture to a great extent and give user additional information regarding the associated uncertainty at the prediction stage. The key idea behind neural networks set in Bayesian framework is that, the weights and bias of the neural network are treated as probability distributions instead of point estimates and, Bayesian inference is used to update their prior distribution. Now, to manage the computational cost associated with approximating the posterior distribution, the proposed VB-DeepONet uses \textit{variational inference}. Unlike Markov Chain Monte Carlo schemes, variational inference has the capacity to take into account high dimensional posterior distributions while kee** the associated computational cost low. Different examples covering mechanics problems like diffusion reaction, gravity pendulum, advection diffusion have been shown to illustrate the performance of the proposed VB-DeepONet and comparisons have also been drawn against DeepONet set in deterministic framework. △ Less

Submitted 12 June, 2022; originally announced June 2022.

arXiv:2202.09931 [pdf, other]

Deconstructing Distributions: A Pointwise Framework of Learning

Authors: Gal Kaplun, Nikhil Ghosh, Saurabh Garg, Boaz Barak, Preetum Nakkiran

Abstract: In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution… ▽ More In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution and their pointwise performance on this individual point. We find that profiles can yield new insights into the structure of both models and data -- in and out-of-distribution. For example, we empirically show that real data distributions consist of points with qualitatively different profiles. On one hand, there are "compatible" points with strong correlation between the pointwise and average performance. On the other hand, there are points with weak and even $\textit{negative}$ correlation: cases where improving overall model accuracy actually $\textit{hurts}$ performance on these inputs. We prove that these experimental observations are inconsistent with the predictions of several simplified models of learning proposed in prior work. As an application, we use profiles to construct a dataset we call CIFAR-10-NEG: a subset of CINIC-10 such that for standard models, accuracy on CIFAR-10-NEG is $\textit{negatively correlated}$ with accuracy on CIFAR-10 test. This illustrates, for the first time, an OOD dataset that completely inverts "accuracy-on-the-line" (Miller, Taori, Raghunathan, Sagawa, Koh, Shankar, Liang, Carmon, and Schmidt 2021) △ Less

Submitted 7 June, 2022; v1 submitted 20 February, 2022; originally announced February 2022.

Comments: GK and NG contributed equally. v2: Added Figures 4, 5

arXiv:2201.13145 [pdf, other]

Assessment of DeepONet for reliability analysis of stochastic nonlinear dynamical systems

Authors: Shailesh Garg, Harshit Gupta, Souvik Chakraborty

Abstract: Time dependent reliability analysis and uncertainty quantification of structural system subjected to stochastic forcing function is a challenging endeavour as it necessitates considerable computational time. We investigate the efficacy of recently proposed DeepONet in solving time dependent reliability analysis and uncertainty quantification of systems subjected to stochastic loading. Unlike conve… ▽ More Time dependent reliability analysis and uncertainty quantification of structural system subjected to stochastic forcing function is a challenging endeavour as it necessitates considerable computational time. We investigate the efficacy of recently proposed DeepONet in solving time dependent reliability analysis and uncertainty quantification of systems subjected to stochastic loading. Unlike conventional machine learning and deep learning algorithms, DeepONet learns is a operator network and learns a function to function map** and hence, is ideally suited to propagate the uncertainty from the stochastic forcing function to the output responses. We use DeepONet to build a surrogate model for the dynamical system under consideration. Multiple case studies, involving both toy and benchmark problems, have been conducted to examine the efficacy of DeepONet in time dependent reliability analysis and uncertainty quantification of linear and nonlinear dynamical systems. Results obtained indicate that the DeepONet architecture is accurate as well as efficient. Moreover, DeepONet posses zero shot learning capabilities and hence, a trained model easily generalizes to unseen and new environment with no further training. △ Less

Submitted 31 January, 2022; originally announced January 2022.

Comments: 21 pages

arXiv:2201.04234 [pdf, other]

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Authors: Saurabh Garg, Sivaraman Balakrishnan, Zachary C. Lipton, Behnam Neyshabur, Hanie Sedghi

Abstract: Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on… ▽ More Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works. Code is available at https://github.com/saurabhgarg1996/ATC_code/. △ Less

Submitted 14 October, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

Comments: Accepted at ICLR 2022

arXiv:2111.08706 [pdf, other]

How and When Random Feedback Works: A Case Study of Low-Rank Matrix Factorization

Authors: Shivam Garg, Santosh S. Vempala

Abstract: The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain… ▽ More The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equi-weighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bio-plausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure -- low-rank matrix factorization. In this problem, given a matrix $Y_{n\times m}$, the goal is to find a low rank factorization $Z_{n \times r}W_{r \times m}$ that minimizes the error $\|ZW-Y\|_F$. Gradient descent solves this problem optimally. We show that FA converges to the optimal solution when $r\ge \mbox{rank}(Y)$. We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when $r < \mbox{rank}(Y)$. This is the first provable separation result between gradient descent and FA. Moreover, the representations found by gradient descent and FA can be almost orthogonal even when their error $\|ZW-Y\|_F$ is approximately equal. As a corollary, these results also hold for training two-layer linear neural networks when the training input is isotropic, and the output is a linear function of the input. △ Less

Submitted 10 April, 2022; v1 submitted 16 November, 2021; originally announced November 2021.

Comments: Fixed minor typos. AISTATS 2022

arXiv:2111.00980 [pdf, other]

Mixture Proportion Estimation and PU Learning: A Modern Approach

Authors: Saurabh Garg, Yifan Wu, Alex Smola, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract: Given only positive examples and unlabeled examples (from both positive and negative classes), we might hope nevertheless to estimate an accurate positive-versus-negative classifier. Formally, this task is broken down into two subtasks: (i) Mixture Proportion Estimation (MPE) -- determining the fraction of positive examples in the unlabeled data; and (ii) PU-learning -- given such an estimate, lea… ▽ More Given only positive examples and unlabeled examples (from both positive and negative classes), we might hope nevertheless to estimate an accurate positive-versus-negative classifier. Formally, this task is broken down into two subtasks: (i) Mixture Proportion Estimation (MPE) -- determining the fraction of positive examples in the unlabeled data; and (ii) PU-learning -- given such an estimate, learning the desired positive-versus-negative classifier. Unfortunately, classical methods for both problems break down in high-dimensional settings. Meanwhile, recently proposed heuristics lack theoretical coherence and depend precariously on hyperparameter tuning. In this paper, we propose two simple techniques: Best Bin Estimation (BBE) (for MPE); and Conditional Value Ignoring Risk (CVIR), a simple objective for PU-learning. Both methods dominate previous approaches empirically, and for BBE, we establish formal guarantees that hold whenever we can train a model to cleanly separate out a small subset of positive examples. Our final algorithm (TED)$^n$, alternates between the two procedures, significantly improving both our mixture proportion estimator and classifier △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: Spotlight at NeurIPS 2021

arXiv:2110.02667 [pdf, other]

Attentive Walk-Aggregating Graph Neural Networks

Authors: Mehmet F. Demirel, Shengchao Liu, Siddhant Garg, Zhenmei Shi, Yingyu Liang

Abstract: Graph neural networks (GNNs) have been shown to possess strong representation power, which can be exploited for downstream prediction tasks on graph-structured data, such as molecules and social networks. They typically learn representations by aggregating information from the $K$-hop neighborhood of individual vertices or from the enumerated walks in the graph. Prior studies have demonstrated the… ▽ More Graph neural networks (GNNs) have been shown to possess strong representation power, which can be exploited for downstream prediction tasks on graph-structured data, such as molecules and social networks. They typically learn representations by aggregating information from the $K$-hop neighborhood of individual vertices or from the enumerated walks in the graph. Prior studies have demonstrated the effectiveness of incorporating weighting schemes into GNNs; however, this has been primarily limited to $K$-hop neighborhood GNNs so far. In this paper, we aim to design an algorithm incorporating weighting schemes into walk-aggregating GNNs and analyze their effect. We propose a novel GNN model, called AWARE, that aggregates information about the walks in the graph using attention schemes. This leads to an end-to-end supervised learning method for graph-level prediction tasks in the standard setting where the input is the adjacency and vertex information of a graph, and the output is a predicted label for the graph. We then perform theoretical, empirical, and interpretability analyses of AWARE. Our theoretical analysis in a simplified setting identifies successful conditions for provable guarantees, demonstrating how the graph information is encoded in the representation, and how the weighting schemes in AWARE affect the representation and learning performance. Our experiments demonstrate the strong performance of AWARE in graph-level prediction tasks in the standard setting in the domains of molecular property prediction and social networks. Lastly, our interpretation study illustrates that AWARE can successfully capture the important substructures of the input graph. The code is available on $\href{https://github.com/mehmetfdemirel/aware}{GitHub}$. △ Less

Submitted 21 August, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

Comments: Published in TMLR (Transactions on Machine Learning Research) (08/2022) 32 pages

arXiv:2109.00538 [pdf, other]

doi 10.1016/j.ymssp.2022.109039

Physics-integrated hybrid framework for model form error identification in nonlinear dynamical systems

Authors: Shailesh Garg, Souvik Chakraborty, Budhaditya Hazra

Abstract: For real-life nonlinear systems, the exact form of nonlinearity is often not known and the known governing equations are often based on certain assumptions and approximations. Such representation introduced model-form error into the system. In this paper, we propose a novel gray-box modeling approach that not only identifies the model-form error but also utilizes it to improve the predictive capab… ▽ More For real-life nonlinear systems, the exact form of nonlinearity is often not known and the known governing equations are often based on certain assumptions and approximations. Such representation introduced model-form error into the system. In this paper, we propose a novel gray-box modeling approach that not only identifies the model-form error but also utilizes it to improve the predictive capability of the known but approximate governing equation. The primary idea is to treat the unknown model-form error as a residual force and estimate it using duel Bayesian filter based joint input-state estimation algorithms. For improving the predictive capability of the underlying physics, we first use machine learning algorithm to learn a map** between the estimated state and the input (model-form error) and then introduce it into the governing equation as an additional term. This helps in improving the predictive capability of the governing physics and allows the model to generalize to unseen environment. Although in theory, any machine learning algorithm can be used within the proposed framework, we use Gaussian process in this work. To test the performance of proposed framework, case studies discussing four different dynamical systems are discussed; results for which indicate that the framework is applicable to a wide variety of systems and can produce reliable estimates of original system's states. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: 23 pages

arXiv:2108.05828 [pdf, other]

A general class of surrogate functions for stable and efficient reinforcement learning

Authors: Sharan Vaswani, Olivier Bachem, Simone Totaro, Robert Mueller, Shivam Garg, Matthieu Geist, Marlos C. Machado, Pablo Samuel Castro, Nicolas Le Roux

Abstract: Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives ris… ▽ More Common policy gradient methods rely on the maximization of a sequence of surrogate functions. In recent years, many such surrogate functions have been proposed, most without strong theoretical guarantees, leading to algorithms such as TRPO, PPO or MPO. Rather than design yet another surrogate function, we instead propose a general framework (FMA-PG) based on functional mirror ascent that gives rise to an entire family of surrogate functions. We construct surrogate functions that enable policy improvement guarantees, a property not shared by most existing surrogate functions. Crucially, these guarantees hold regardless of the choice of policy parameterization. Moreover, a particular instantiation of FMA-PG recovers important implementation heuristics (e.g., using forward vs reverse KL divergence) resulting in a variant of TRPO with additional desirable properties. Via experiments on simple bandit problems, we evaluate the algorithms instantiated by FMA-PG. The proposed framework also suggests an improved variant of PPO, whose robustness and efficiency we empirically demonstrate on the MuJoCo suite. △ Less

Submitted 30 October, 2023; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: Fixed minor typos

arXiv:2105.00303 [pdf, other]

RATT: Leveraging Unlabeled Data to Guarantee Generalization

Authors: Saurabh Garg, Sivaraman Balakrishnan, J. Zico Kolter, Zachary C. Lipton

Abstract: To assess generalization, machine learning scientists typically either (i) bound the generalization gap and then (after training) plug in the empirical risk to obtain a bound on the true risk; or (ii) validate empirically on holdout data. However, (i) typically yields vacuous guarantees for overparameterized models. Furthermore, (ii) shrinks the training set and its guarantee erodes with each re-u… ▽ More To assess generalization, machine learning scientists typically either (i) bound the generalization gap and then (after training) plug in the empirical risk to obtain a bound on the true risk; or (ii) validate empirically on holdout data. However, (i) typically yields vacuous guarantees for overparameterized models. Furthermore, (ii) shrinks the training set and its guarantee erodes with each re-use of the holdout set. In this paper, we introduce a method that leverages unlabeled data to produce generalization bounds. After augmenting our (labeled) training set with randomly labeled fresh examples, we train in the standard fashion. Whenever classifiers achieve low error on clean data and high error on noisy data, our bound provides a tight upper bound on the true risk. We prove that our bound is valid for 0-1 empirical risk minimization and with linear classifiers trained by gradient descent. Our approach is especially useful in conjunction with deep learning due to the early learning phenomenon whereby networks fit true labels before noisy labels but requires one intuitive assumption. Empirically, on canonical computer vision and NLP tasks, our bound provides non-vacuous generalization guarantees that track actual performance closely. This work provides practitioners with an option for certifying the generalization of deep nets even when unseen labeled data is unavailable and provides theoretical insights into the relationship between random label noise and generalization. △ Less

Submitted 6 November, 2021; v1 submitted 1 May, 2021; originally announced May 2021.

Comments: ICML 2021 (Long Talk)

arXiv:2103.15636 [pdf, other]

Machine learning based digital twin for stochastic nonlinear multi-degree of freedom dynamical system

Authors: Shailesh Garg, Ankush Gogoi, Souvik Chakraborty, Budhaditya Hazra

Abstract: The potential of digital twin technology is immense, specifically in the infrastructure, aerospace, and automotive sector. However, practical implementation of this technology is not at an expected speed, specifically because of lack of application-specific details. In this paper, we propose a novel digital twin framework for stochastic nonlinear multi-degree of freedom (MDOF) dynamical systems. T… ▽ More The potential of digital twin technology is immense, specifically in the infrastructure, aerospace, and automotive sector. However, practical implementation of this technology is not at an expected speed, specifically because of lack of application-specific details. In this paper, we propose a novel digital twin framework for stochastic nonlinear multi-degree of freedom (MDOF) dynamical systems. The approach proposed in this paper strategically decouples the problem into two time-scales -- (a) a fast time-scale governing the system dynamics and (b) a slow time-scale governing the degradation in the system. The proposed digital twin has four components - (a) a physics-based nominal model (low-fidelity), (b) a Bayesian filtering algorithm a (c) a supervised machine learning algorithm and (d) a high-fidelity model for predicting future responses. The physics-based nominal model combined with Bayesian filtering is used combined parameter state estimation and the supervised machine learning algorithm is used for learning the temporal evolution of the parameters. While the proposed framework can be used with any choice of Bayesian filtering and machine learning algorithm, we propose to use unscented Kalman filter and Gaussian process. Performance of the proposed approach is illustrated using two examples. Results obtained indicate the applicability and excellent performance of the proposed digital twin framework. △ Less

Submitted 29 March, 2021; originally announced March 2021.

Comments: 21 pages

arXiv:2102.10264 [pdf, other]

On Proximal Policy Optimization's Heavy-tailed Gradients

Authors: Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J. Zico Kolter, Zachary C. Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar

Abstract: Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clip** and gradient clip**, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (``heavy-tailed'') regimes. In this paper, we present a detailed empirical study to characteriz… ▽ More Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clip** and gradient clip**, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (``heavy-tailed'') regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clip** heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clip** tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks. △ Less

Submitted 12 July, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

Comments: ICML 2021

arXiv:2009.09283 [pdf, other]

Subverting Privacy-Preserving GANs: Hiding Secrets in Sanitized Images

Authors: Kang Liu, Benjamin Tan, Siddharth Garg

Abstract: Unprecedented data collection and sharing have exacerbated privacy concerns and led to increasing interest in privacy-preserving tools that remove sensitive attributes from images while maintaining useful information for other tasks. Currently, state-of-the-art approaches use privacy-preserving generative adversarial networks (PP-GANs) for this purpose, for instance, to enable reliable facial expr… ▽ More Unprecedented data collection and sharing have exacerbated privacy concerns and led to increasing interest in privacy-preserving tools that remove sensitive attributes from images while maintaining useful information for other tasks. Currently, state-of-the-art approaches use privacy-preserving generative adversarial networks (PP-GANs) for this purpose, for instance, to enable reliable facial expression recognition without leaking users' identity. However, PP-GANs do not offer formal proofs of privacy and instead rely on experimentally measuring information leakage using classification accuracy on the sensitive attributes of deep learning (DL)-based discriminators. In this work, we question the rigor of such checks by subverting existing privacy-preserving GANs for facial expression recognition. We show that it is possible to hide the sensitive identification data in the sanitized output images of such PP-GANs for later extraction, which can even allow for reconstruction of the entire input images, while satisfying privacy checks. We demonstrate our approach via a PP-GAN-based architecture and provide qualitative and quantitative evaluations using two public datasets. Our experimental results raise fundamental questions about the need for more rigorous privacy checks of PP-GANs, and we provide insights into the social impact of these. △ Less

Submitted 19 September, 2020; originally announced September 2020.

arXiv:2008.12338 [pdf, other]

Adversarially Robust Learning via Entropic Regularization

Authors: Gauri Jagatap, Ameya Joshi, Animesh Basak Chowdhury, Siddharth Garg, Chinmay Hegde

Abstract: In this paper we propose a new family of algorithms, ATENT, for training adversarially robust deep neural networks. We formulate a new loss function that is equipped with an additional entropic regularization. Our loss function considers the contribution of adversarial samples that are drawn from a specially designed distribution in the data space that assigns high probability to points with high… ▽ More In this paper we propose a new family of algorithms, ATENT, for training adversarially robust deep neural networks. We formulate a new loss function that is equipped with an additional entropic regularization. Our loss function considers the contribution of adversarial samples that are drawn from a specially designed distribution in the data space that assigns high probability to points with high loss and in the immediate neighborhood of training samples. Our proposed algorithms optimize this loss to seek adversarially robust valleys of the loss landscape. Our approach achieves competitive (or better) performance in terms of robust classification accuracy as compared to several state-of-the-art robust learning approaches on benchmark datasets such as MNIST and CIFAR-10. △ Less

Submitted 19 February, 2021; v1 submitted 27 August, 2020; originally announced August 2020.

arXiv:2008.02447 [pdf, other]

Functional Regularization for Representation Learning: A Unified Theoretical Perspective

Authors: Siddhant Garg, Yingyu Liang

Abstract: Unsupervised and self-supervised learning approaches have become a crucial tool to learn representations for downstream prediction tasks. While these approaches are widely used in practice and achieve impressive empirical gains, their theoretical understanding largely lags behind. Towards bridging this gap, we present a unifying perspective where several such approaches can be viewed as imposing a… ▽ More Unsupervised and self-supervised learning approaches have become a crucial tool to learn representations for downstream prediction tasks. While these approaches are widely used in practice and achieve impressive empirical gains, their theoretical understanding largely lags behind. Towards bridging this gap, we present a unifying perspective where several such approaches can be viewed as imposing a regularization on the representation via a learnable function using unlabeled data. We propose a discriminative theoretical framework for analyzing the sample complexity of these approaches, which generalizes the framework of (Balcan and Blum, 2010) to allow learnable regularization functions. Our sample complexity bounds show that, with carefully chosen hypothesis classes to exploit the structure in the data, these learnable regularization functions can prune the hypothesis space, and help reduce the amount of labeled data needed. We then provide two concrete examples of functional regularization, one using auto-encoders and the other using masked self-supervision, and apply our framework to quantify the reduction in the sample complexity bound of labeled data. We also provide complementary empirical results to support our analysis. △ Less

Submitted 21 October, 2020; v1 submitted 6 August, 2020; originally announced August 2020.

Comments: Accepted at NeurIPS 2020

arXiv:2008.01761 [pdf, other]

doi 10.1145/3340531.3412130

Can Adversarial Weight Perturbations Inject Neural Backdoors?

Authors: Siddhant Garg, Adarsh Kumar, Vibhor Goel, Yingyu Liang

Abstract: Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" t… ▽ More Adversarial machine learning has exposed several security hazards of neural models and has become an important research topic in recent times. Thus far, the concept of an "adversarial perturbation" has exclusively been used with reference to the input space referring to a small, imperceptible change which can cause a ML model to err. In this work we extend the idea of "adversarial perturbations" to the space of model weights, specifically to inject backdoors in trained DNNs, which exposes a security risk of using publicly available trained models. Here, injecting a backdoor refers to obtaining a desired outcome from the model when a trigger pattern is added to the input, while retaining the original model predictions on a non-triggered input. From the perspective of an adversary, we characterize these adversarial perturbations to be constrained within an $\ell_{\infty}$ norm around the original model weights. We introduce adversarial perturbations in the model weights using a composite loss on the predictions of the original model and the desired trigger through projected gradient descent. We empirically show that these adversarial weight perturbations exist universally across several computer vision and natural language processing tasks. Our results show that backdoors can be successfully injected with a very small average relative change in model weight values for several applications. △ Less

Submitted 21 September, 2020; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted as a conference paper at CIKM 2020

arXiv:2007.09712 [pdf, other]

doi 10.1109/JIOT.2020.3011726

Deep Anomaly Detection for Time-series Data in Industrial IoT: A Communication-Efficient On-device Federated Learning Approach

Authors: Yi Liu, Sahil Garg, Jiangtian Nie, Yang Zhang, Zehui Xiong, Jiawen Kang, M. Shamim Hossain

Abstract: Since edge device failures (i.e., anomalies) seriously affect the production of industrial products in Industrial IoT (IIoT), accurately and timely detecting anomalies is becoming increasingly important. Furthermore, data collected by the edge device may contain the user's private data, which is challenging the current detection approaches as user privacy is calling for the public concern in recen… ▽ More Since edge device failures (i.e., anomalies) seriously affect the production of industrial products in Industrial IoT (IIoT), accurately and timely detecting anomalies is becoming increasingly important. Furthermore, data collected by the edge device may contain the user's private data, which is challenging the current detection approaches as user privacy is calling for the public concern in recent years. With this focus, this paper proposes a new communication-efficient on-device federated learning (FL)-based deep anomaly detection framework for sensing time-series data in IIoT. Specifically, we first introduce a FL framework to enable decentralized edge devices to collaboratively train an anomaly detection model, which can improve its generalization ability. Second, we propose an Attention Mechanism-based Convolutional Neural Network-Long Short Term Memory (AMCNN-LSTM) model to accurately detect anomalies. The AMCNN-LSTM model uses attention mechanism-based CNN units to capture important fine-grained features, thereby preventing memory loss and gradient dispersion problems. Furthermore, this model retains the advantages of LSTM unit in predicting time series data. Third, to adapt the proposed framework to the timeliness of industrial anomaly detection, we propose a gradient compression mechanism based on Top-\textit{k} selection to improve communication efficiency. Extensive experiment studies on four real-world datasets demonstrate that the proposed framework can accurately and timely detect anomalies and also reduce the communication overhead by 50\% compared to the federated learning framework that does not use a gradient compression scheme. △ Less

Submitted 19 July, 2020; originally announced July 2020.

Comments: IEEE Internet of Things Journal

arXiv:2007.00611 [pdf, other]

Gradient Temporal-Difference Learning with Regularized Corrections

Authors: Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, Martha White

Abstract: It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously thought. Practitioners face a difficult dilemma: choose an ea… ▽ More It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well. However, recent work with large neural network learning systems reveals that instability is more common than previously thought. Practitioners face a difficult dilemma: choose an easy to use and performant TD method, or a more complex algorithm that is more sound but harder to tune and all but unexplored with non-linear function approximation or control. In this paper, we introduce a new method called TD with Regularized Corrections (TDRC), that attempts to balance ease of use, soundness, and performance. It behaves as well as TD, when TD performs well, but is sound in cases where TD diverges. We empirically investigate TDRC across a range of problems, for both prediction and control, and for both linear and non-linear function approximation, and show, potentially for the first time, that gradient TD methods could be a better alternative to TD and Q-learning. △ Less

Submitted 17 September, 2020; v1 submitted 1 July, 2020; originally announced July 2020.

Comments: Appeared in Proceedings of the 37th International Conference on Machine Learning (ICML2020)

arXiv:2006.08733 [pdf, other]

CryptoNAS: Private Inference on a ReLU Budget

Authors: Zahra Ghodsi, Akshaj Veldanda, Brandon Reagen, Siddharth Garg

Abstract: Machine learning as a service has given raise to privacy concerns surrounding clients' data and providers' models and has catalyzed research in private inference (PI): methods to process inferences without disclosing inputs. Recently, researchers have adapted cryptographic techniques to show PI is possible, however all solutions increase inference latency beyond practical limits. This paper makes… ▽ More Machine learning as a service has given raise to privacy concerns surrounding clients' data and providers' models and has catalyzed research in private inference (PI): methods to process inferences without disclosing inputs. Recently, researchers have adapted cryptographic techniques to show PI is possible, however all solutions increase inference latency beyond practical limits. This paper makes the observation that existing models are ill-suited for PI and proposes a novel NAS method, named CryptoNAS, for finding and tailoring models to the needs of PI. The key insight is that in PI operator latency cost are non-linear operations (e.g., ReLU) dominate latency, while linear layers become effectively free. We develop the idea of a ReLU budget as a proxy for inference latency and use CryptoNAS to build models that maximize accuracy within a given budget. CryptoNAS improves accuracy by 3.4% and latency by 2.4x over the state-of-the-art. △ Less

Submitted 13 May, 2021; v1 submitted 15 June, 2020; originally announced June 2020.

arXiv:2005.08377 [pdf, ps, other]

The Role of Randomness and Noise in Strategic Classification

Authors: Mark Braverman, Sumegha Garg

Abstract: We investigate the problem of designing optimal classifiers in the strategic classification setting, where the classification is part of a game in which players can modify their features to attain a favorable classification outcome (while incurring some cost). Previously, the problem has been considered from a learning-theoretic perspective and from the algorithmic fairness perspective. Our main c… ▽ More We investigate the problem of designing optimal classifiers in the strategic classification setting, where the classification is part of a game in which players can modify their features to attain a favorable classification outcome (while incurring some cost). Previously, the problem has been considered from a learning-theoretic perspective and from the algorithmic fairness perspective. Our main contributions include 1. Showing that if the objective is to maximize the efficiency of the classification process (defined as the accuracy of the outcome minus the sunk cost of the qualified players manipulating their features to gain a better outcome), then using randomized classifiers (that is, ones where the probability of a given feature vector to be accepted by the classifier is strictly between 0 and 1) is necessary. 2. Showing that in many natural cases, the imposed optimal solution (in terms of efficiency) has the structure where players never change their feature vectors (the randomized classifier is structured in a way, such that the gain in the probability of being classified as a 1 does not justify the expense of changing one's features). 3. Observing that the randomized classification is not a stable best-response from the classifier's viewpoint, and that the classifier doesn't benefit from randomized classifiers without creating instability in the system. 4. Showing that in some cases, a noisier signal leads to better equilibria outcomes -- improving both accuracy and fairness when more than one subpopulation with different feature adjustment costs are involved. This is interesting from a policy perspective, since it is hard to force institutions to stick to a particular randomized classification strategy (especially in a context of a market with multiple classifiers), but it is possible to alter the information environment to make the feature signals inherently noisier. △ Less

Submitted 17 May, 2020; originally announced May 2020.

Comments: 22 pages. Appeared in FORC, 2020

arXiv:2004.12492 [pdf, other]

Bias Busters: Robustifying DL-based Lithographic Hotspot Detectors Against Backdooring Attacks

Authors: Kang Liu, Benjamin Tan, Gaurav Rajavendra Reddy, Siddharth Garg, Yiorgos Makris, Ramesh Karri

Abstract: Deep learning (DL) offers potential improvements throughout the CAD tool-flow, one promising application being lithographic hotspot detection. However, DL techniques have been shown to be especially vulnerable to inference and training time adversarial attacks. Recent work has demonstrated that a small fraction of malicious physical designers can stealthily "backdoor" a DL-based hotspot detector d… ▽ More Deep learning (DL) offers potential improvements throughout the CAD tool-flow, one promising application being lithographic hotspot detection. However, DL techniques have been shown to be especially vulnerable to inference and training time adversarial attacks. Recent work has demonstrated that a small fraction of malicious physical designers can stealthily "backdoor" a DL-based hotspot detector during its training phase such that it accurately classifies regular layout clips but predicts hotspots containing a specially crafted trigger shape as non-hotspots. We propose a novel training data augmentation strategy as a powerful defense against such backdooring attacks. The defense works by eliminating the intentional biases introduced in the training data but does not require knowledge of which training samples are poisoned or the nature of the backdoor trigger. Our results show that the defense can drastically reduce the attack success rate from 84% to ~0%. △ Less

Submitted 26 April, 2020; originally announced April 2020.

arXiv:2004.09900 [pdf, other]

An RNN-Survival Model to Decide Email Send Times

Authors: Harvineet Singh, Moumita Sinha, Atanu R. Sinha, Sahil Garg, Neha Banerjee

Abstract: Email communications are ubiquitous. Firms control send times of emails and thereby the instants at which emails reach recipients (it is assumed email is received instantaneously from the send time). However, they do not control the duration it takes for recipients to open emails, labeled as time-to-open. Importantly, among emails that are opened, most occur within a short window from their send t… ▽ More Email communications are ubiquitous. Firms control send times of emails and thereby the instants at which emails reach recipients (it is assumed email is received instantaneously from the send time). However, they do not control the duration it takes for recipients to open emails, labeled as time-to-open. Importantly, among emails that are opened, most occur within a short window from their send times. We posit that emails are likely to be opened sooner when send times are convenient for recipients, while for other send times, emails can get ignored. Thus, to compute appropriate send times it is important to predict times-to-open accurately. We propose a recurrent neural network (RNN) in a survival model framework to predict times-to-open, for each recipient. Using that we compute appropriate send times. We experiment on a data set of emails sent to a million customers over five months. The sequence of emails received by a person from a sender is a result of interactions with past emails from the sender, and hence contain useful signal that inform our model. This sequential dependence affords our proposed RNN-Survival (RNN-S) approach to outperform survival analysis approaches in predicting times-to-open. We show that best times to send emails can be computed accurately from predicted times-to-open. This approach allows a firm to tune send times of emails, which is in its control, to favorably influence open rates and engagement. △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: 11 pages, 3 figures, 2 tables

arXiv:2003.12020 [pdf, ps, other]

A Separation Result Between Data-oblivious and Data-aware Poisoning Attacks

Authors: Samuel Deng, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, Abhradeep Thakurta

Abstract: Poisoning attacks have emerged as a significant security threat to machine learning algorithms. It has been demonstrated that adversaries who make small changes to the training set, such as adding specially crafted data points, can hurt the performance of the output model. Some of the stronger poisoning attacks require the full knowledge of the training data. This leaves open the possibility of ac… ▽ More Poisoning attacks have emerged as a significant security threat to machine learning algorithms. It has been demonstrated that adversaries who make small changes to the training set, such as adding specially crafted data points, can hurt the performance of the output model. Some of the stronger poisoning attacks require the full knowledge of the training data. This leaves open the possibility of achieving the same attack results using poisoning attacks that do not have the full knowledge of the clean training set. In this work, we initiate a theoretical study of the problem above. Specifically, for the case of feature selection with LASSO, we show that full-information adversaries (that craft poisoning examples based on the rest of the training data) are provably stronger than the optimal attacker that is oblivious to the training set yet has access to the distribution of the data. Our separation result shows that the two setting of data-aware and data-oblivious are fundamentally different and we cannot hope to always achieve the same attack or defense results in these scenarios. △ Less

Submitted 13 December, 2021; v1 submitted 26 March, 2020; originally announced March 2020.

arXiv:2003.07554 [pdf, other]

A Unified View of Label Shift Estimation

Authors: Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract: Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirical… ▽ More Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirically. In this paper, we present a unified view of the two methods and the first theoretical characterization of MLLS. Our contributions include (i) consistency conditions for MLLS, which include calibration of the classifier and a confusion matrix invertibility condition that BBSE also requires; (ii) a unified framework, casting BBSE as roughly equivalent to MLLS for a particular choice of calibration method; and (iii) a decomposition of MLLS's finite-sample error into terms reflecting miscalibration and estimation error. Our analysis attributes BBSE's statistical inefficiency to a loss of information due to coarse calibration. Experiments on synthetic data, MNIST, and CIFAR10 support our findings. △ Less

Submitted 16 October, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

Comments: Accepted at Neurips 2020

arXiv:2003.03919 [pdf, other]

doi 10.24963/ijcai.2020/386

Temporal Attribute Prediction via Joint Modeling of Multi-Relational Structure Evolution

Authors: Sankalp Garg, Navodita Sharma, Woojeong **, Xiang Ren

Abstract: Time series prediction is an important problem in machine learning. Previous methods for time series prediction did not involve additional information. With a lot of dynamic knowledge graphs available, we can use this additional information to predict the time series better. Recently, there has been a focus on the application of deep representation learning on dynamic graphs. These methods predict… ▽ More Time series prediction is an important problem in machine learning. Previous methods for time series prediction did not involve additional information. With a lot of dynamic knowledge graphs available, we can use this additional information to predict the time series better. Recently, there has been a focus on the application of deep representation learning on dynamic graphs. These methods predict the structure of the graph by reasoning over the interactions in the graph at previous time steps. In this paper, we propose a new framework to incorporate the information from dynamic knowledge graphs for time series prediction. We show that if the information contained in the graph and the time series data are closely related, then this inter-dependence can be used to predict the time series with improved accuracy. Our framework, DArtNet, learns a static embedding for every node in the graph as well as a dynamic embedding which is dependent on the dynamic attribute value (time-series). Then it captures the information from the neighborhood by taking a relation specific mean and encodes the history information using RNN. We jointly train the model link prediction and attribute prediction. We evaluate our method on five specially curated datasets for this problem and show a consistent improvement in time series prediction results. We release the data and code of model DArtNet for future research at https://github.com/INK-USC/DArtNet . △ Less

Submitted 13 July, 2020; v1 submitted 9 March, 2020; originally announced March 2020.

Comments: In Proceedings of IJCAI 2020. Code can be found at https://github.com/INK-USC/DArtNet . The sole copyright holder is IJCAI (International Joint Conferences on Artificial Intelligence), all rights reserved. Original Publication available at https://www.ijcai.org/Proceedings/2020/386

arXiv:2002.07375 [pdf, other]

Symbolic Network: Generalized Neural Policies for Relational MDPs

Authors: Sankalp Garg, Aniket Bajpai, Mausam

Abstract: A Relational Markov Decision Process (RMDP) is a first-order representation to express all instances of a single probabilistic planning domain with possibly unbounded number of objects. Early work in RMDPs outputs generalized (instance-independent) first-order policies or value functions as a means to solve all instances of a domain at once. Unfortunately, this line of work met with limited succes… ▽ More A Relational Markov Decision Process (RMDP) is a first-order representation to express all instances of a single probabilistic planning domain with possibly unbounded number of objects. Early work in RMDPs outputs generalized (instance-independent) first-order policies or value functions as a means to solve all instances of a domain at once. Unfortunately, this line of work met with limited success due to inherent limitations of the representation space used in such policies or value functions. Can neural models provide the missing link by easily representing more complex generalized policies, thus making them effective on all instances of a given domain? We present SymNet, the first neural approach for solving RMDPs that are expressed in the probabilistic planning language of RDDL. SymNet trains a set of shared parameters for an RDDL domain using training instances from that domain. For each instance, SymNet first converts it to an instance graph and then uses relational neural models to compute node embeddings. It then scores each ground action as a function over the first-order action symbols and node embeddings related to the action. Given a new test instance from the same domain, SymNet architecture with pre-trained parameters scores each ground action and chooses the best action. This can be accomplished in a single forward pass without any retraining on the test instance, thus implicitly representing a neural generalized policy for the whole domain. Our experiments on nine RDDL domains from IPPC demonstrate that SymNet policies are significantly better than random and sometimes even more effective than training a state-of-the-art deep reactive policy from scratch. △ Less

Submitted 29 June, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

Comments: In Proceeding of ICML 2020. Code can be found at https://github.com/dair-iitd/symnet

arXiv:1910.01161 [pdf, ps, other]

Stochastic Bandits with Delayed Composite Anonymous Feedback

Authors: Siddhant Garg, Aditya Kumar Akash

Abstract: We explore a novel setting of the Multi-Armed Bandit (MAB) problem inspired from real world applications which we call bandits with "stochastic delayed composite anonymous feedback (SDCAF)". In SDCAF, the rewards on pulling arms are stochastic with respect to time but spread over a fixed number of time steps in the future after pulling the arm. The complexity of this problem stems from the anonymo… ▽ More We explore a novel setting of the Multi-Armed Bandit (MAB) problem inspired from real world applications which we call bandits with "stochastic delayed composite anonymous feedback (SDCAF)". In SDCAF, the rewards on pulling arms are stochastic with respect to time but spread over a fixed number of time steps in the future after pulling the arm. The complexity of this problem stems from the anonymous feedback to the player and the stochastic generation of the reward. Due to the aggregated nature of the rewards, the player is unable to associate the reward to a particular time step from the past. We present two algorithms for this more complicated setting of SDCAF using phase based extensions of the UCB algorithm. We perform regret analysis to show sub-linear theoretical guarantees on both the algorithms. △ Less

Submitted 11 October, 2019; v1 submitted 2 October, 2019; originally announced October 2019.

Comments: 33rd Conference on Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning with Guarantees

arXiv:1909.03881 [pdf, other]

Nearly-Unsupervised Hashcode Representations for Relation Extraction

Authors: Sahil Garg, Aram Galstyan, Greg Ver Steeg, Guillermo Cecchi

Abstract: Recently, kernelized locality sensitive hashcodes have been successfully employed as representations of natural language text, especially showing high relevance to biomedical relation extraction tasks. In this paper, we propose to optimize the hashcode representations in a nearly unsupervised manner, in which we only use data points, but not their class labels, for learning. The optimized hashcode… ▽ More Recently, kernelized locality sensitive hashcodes have been successfully employed as representations of natural language text, especially showing high relevance to biomedical relation extraction tasks. In this paper, we propose to optimize the hashcode representations in a nearly unsupervised manner, in which we only use data points, but not their class labels, for learning. The optimized hashcode representations are then fed to a supervised classifier following the prior work. This nearly unsupervised approach allows fine-grained optimization of each hash function, which is particularly suitable for building hashcode representations generalizing from a training set to a test set. We empirically evaluate the proposed approach for biomedical relation extraction tasks, obtaining significant accuracy improvements w.r.t. state-of-the-art supervised and semi-supervised approaches. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Comments: Proceedings of EMNLP-19

arXiv:1907.01643 [pdf, other]

Pentagon at MEDIQA 2019: Multi-task Learning for Filtering and Re-ranking Answers using Language Inference and Question Entailment

Authors: Hemant Pugaliya, Karan Saxena, Shefali Garg, Sheetal Shalini, Prashant Gupta, Eric Nyberg, Teruko Mitamura

Abstract: Parallel deep learning architectures like fine-tuned BERT and MT-DNN, have quickly become the state of the art, bypassing previous deep and shallow learning methods by a large margin. More recently, pre-trained models from large related datasets have been able to perform well on many downstream tasks by just fine-tuning on domain-specific datasets . However, using powerful models on non-trivial ta… ▽ More Parallel deep learning architectures like fine-tuned BERT and MT-DNN, have quickly become the state of the art, bypassing previous deep and shallow learning methods by a large margin. More recently, pre-trained models from large related datasets have been able to perform well on many downstream tasks by just fine-tuning on domain-specific datasets . However, using powerful models on non-trivial tasks, such as ranking and large document classification, still remains a challenge due to input size limitations of parallel architecture and extremely small datasets (insufficient for fine-tuning). In this work, we introduce an end-to-end system, trained in a multi-task setting, to filter and re-rank answers in the medical domain. We use task-specific pre-trained models as deep feature extractors. Our model achieves the highest Spearman's Rho and Mean Reciprocal Rank of 0.338 and 0.9622 respectively, on the ACL-BioNLP workshop MediQA Question Answering shared-task. △ Less

Submitted 1 July, 2019; originally announced July 2019.

arXiv:1906.10773 [pdf, other]

doi 10.1145/3408288

Are Adversarial Perturbations a Showstopper for ML-Based CAD? A Case Study on CNN-Based Lithographic Hotspot Detection

Authors: Kang Liu, Haoyu Yang, Yuzhe Ma, Benjamin Tan, Bei Yu, Evangeline F. Y. Young, Ramesh Karri, Siddharth Garg

Abstract: There is substantial interest in the use of machine learning (ML) based techniques throughout the electronic computer-aided design (CAD) flow, particularly those based on deep learning. However, while deep learning methods have surpassed state-of-the-art performance in several applications, they have exhibited intrinsic susceptibility to adversarial perturbations --- small but deliberate alteratio… ▽ More There is substantial interest in the use of machine learning (ML) based techniques throughout the electronic computer-aided design (CAD) flow, particularly those based on deep learning. However, while deep learning methods have surpassed state-of-the-art performance in several applications, they have exhibited intrinsic susceptibility to adversarial perturbations --- small but deliberate alterations to the input of a neural network, precipitating incorrect predictions. In this paper, we seek to investigate whether adversarial perturbations pose risks to ML-based CAD tools, and if so, how these risks can be mitigated. To this end, we use a motivating case study of lithographic hotspot detection, for which convolutional neural networks (CNN) have shown great promise. In this context, we show the first adversarial perturbation attacks on state-of-the-art CNN-based hotspot detectors; specifically, we show that small (on average 0.5% modified area), functionality preserving and design-constraint satisfying changes to a layout can nonetheless trick a CNN-based hotspot detector into predicting the modified layout as hotspot free (with up to 99.7% success). We propose an adversarial retraining strategy to improve the robustness of CNN-based hotspot detection and show that this strategy significantly improves robustness (by a factor of ~3) against adversarial attacks without compromising classification accuracy. △ Less

Submitted 25 June, 2019; originally announced June 2019.

Journal ref: ACM Trans. Des. Autom. Electron. Syst. 25, 5, Article 48 (August 2020)

arXiv:1905.11564 [pdf, ps, other]

Adversarially Robust Learning Could Leverage Computational Hardness

Authors: Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody

Abstract: Over recent years, devising classification algorithms that are robust to adversarial perturbations has emerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible to small imperceptible changes over test instances. However, the line of work in provable robustness, so far, has been focused on information-theoretic robustness, ruling out even the existence of any… ▽ More Over recent years, devising classification algorithms that are robust to adversarial perturbations has emerged as a challenging problem. In particular, deep neural nets (DNNs) seem to be susceptible to small imperceptible changes over test instances. However, the line of work in provable robustness, so far, has been focused on information-theoretic robustness, ruling out even the existence of any adversarial examples. In this work, we study whether there is a hope to benefit from algorithmic nature of an attacker that searches for adversarial examples, and ask whether there is any learning task for which it is possible to design classifiers that are only robust against polynomial-time adversaries. Indeed, numerous cryptographic tasks can only be secure against computationally bounded adversaries, and are indeed impossible for computationally unbounded attackers. Thus, it is natural to ask if the same strategy could help robust learning. We show that computational limitation of attackers can indeed be useful in robust learning by demonstrating the possibility of a classifier for some learning task for which computational and information theoretic adversaries of bounded perturbations have very different power. Namely, while computationally unbounded adversaries can attack successfully and find adversarial examples with small perturbation, polynomial time adversaries are unable to do so unless they can break standard cryptographic hardness assumptions. Our results, therefore, indicate that perhaps a similar approach to cryptography (relying on computational hardness) holds promise for achieving computationally robust machine learning. On the reverse directions, we also show that the existence of such learning task in which computational robustness beats information theoretic robustness requires computational hardness by implying (average-case) hardness of NP. △ Less

Submitted 19 December, 2019; v1 submitted 27 May, 2019; originally announced May 2019.

arXiv:1905.07088 [pdf, other]

Sliced Score Matching: A Scalable Approach to Density and Score Estimation

Authors: Yang Song, Sahaj Garg, Jiaxin Shi, Stefano Ermon

Abstract: Score matching is a popular method for estimating unnormalized statistical models. However, it has been so far limited to simple, shallow models or low-dimensional data, due to the difficulty of computing the Hessian of log-density functions. We show this difficulty can be mitigated by projecting the scores onto random vectors before comparing them. This objective, called sliced score matching, on… ▽ More Score matching is a popular method for estimating unnormalized statistical models. However, it has been so far limited to simple, shallow models or low-dimensional data, due to the difficulty of computing the Hessian of log-density functions. We show this difficulty can be mitigated by projecting the scores onto random vectors before comparing them. This objective, called sliced score matching, only involves Hessian-vector products, which can be easily implemented using reverse-mode automatic differentiation. Therefore, sliced score matching is amenable to more complex models and higher dimensional data compared to score matching. Theoretically, we prove the consistency and asymptotic normality of sliced score matching estimators. Moreover, we demonstrate that sliced score matching can be used to learn deep score estimators for implicit distributions. In our experiments, we show sliced score matching can learn deep energy-based models effectively, and can produce accurate score estimates for applications such as variational inference with implicit distributions and training Wasserstein Auto-Encoders. △ Less

Submitted 27 June, 2019; v1 submitted 16 May, 2019; originally announced May 2019.

Comments: UAI 2019

arXiv:1904.12053 [pdf, other]

Sample Amplification: Increasing Dataset Size even when Learning is Impossible

Authors: Brian Axelrod, Shivam Garg, Vatsal Sharan, Gregory Valiant

Abstract: Given data drawn from an unknown distribution, $D$, to what extent is it possible to ``amplify'' this dataset and output an even larger set of samples that appear to have been drawn from $D$? We formalize this question as follows: an $(n,m)$ $\text{amplification procedure}$ takes as input $n$ independent draws from an unknown distribution $D$, and outputs a set of $m > n$ ``samples''. An amplifica… ▽ More Given data drawn from an unknown distribution, $D$, to what extent is it possible to ``amplify'' this dataset and output an even larger set of samples that appear to have been drawn from $D$? We formalize this question as follows: an $(n,m)$ $\text{amplification procedure}$ takes as input $n$ independent draws from an unknown distribution $D$, and outputs a set of $m > n$ ``samples''. An amplification procedure is valid if no algorithm can distinguish the set of $m$ samples produced by the amplifier from a set of $m$ independent draws from $D$, with probability greater than $2/3$. Perhaps surprisingly, in many settings, a valid amplification procedure exists, even when the size of the input dataset, $n$, is significantly less than what would be necessary to learn $D$ to non-trivial accuracy. Specifically we consider two fundamental settings: the case where $D$ is an arbitrary discrete distribution supported on $\le k$ elements, and the case where $D$ is a $d$-dimensional Gaussian with unknown mean, and fixed covariance. In the first case, we show that an $\left(n, n + Θ(\frac{n}{\sqrt{k}})\right)$ amplifier exists. In particular, given $n=O(\sqrt{k})$ samples from $D$, one can output a set of $m=n+1$ datapoints, whose total variation distance from the distribution of $m$ i.i.d. draws from $D$ is a small constant, despite the fact that one would need quadratically more data, $n=Θ(k)$, to learn $D$ up to small constant total variation distance. In the Gaussian case, we show that an $\left(n,n+Θ(\frac{n}{\sqrt{d}} )\right)$ amplifier exists, even though learning the distribution to small constant total variation distance requires $Θ(d)$ samples. In both the discrete and Gaussian settings, we show that these results are tight, to constant factors. Beyond these results, we formalize a number of curious directions for future research along this vein. △ Less

Submitted 2 December, 2019; v1 submitted 26 April, 2019; originally announced April 2019.

Comments: Added discussion about potential applications

arXiv:1904.09942 [pdf, other]

Tracking and Improving Information in the Service of Fairness

Authors: Sumegha Garg, Michael P. Kim, Omer Reingold

Abstract: As algorithmic prediction systems have become widespread, fears that these systems may inadvertently discriminate against members of underrepresented populations have grown. With the goal of understanding fundamental principles that underpin the growing number of approaches to mitigating algorithmic discrimination, we investigate the role of information in fair prediction. A common strategy for de… ▽ More As algorithmic prediction systems have become widespread, fears that these systems may inadvertently discriminate against members of underrepresented populations have grown. With the goal of understanding fundamental principles that underpin the growing number of approaches to mitigating algorithmic discrimination, we investigate the role of information in fair prediction. A common strategy for decision-making uses a predictor to assign individuals a risk score; then, individuals are selected or rejected on the basis of this score. In this work, we study a formal framework for measuring the information content of predictors. Central to this framework is the notion of a refinement, first studied by Degroot and Fienberg. Intuitively, a refinement of a predictor $z$ increases the overall informativeness of the predictions without losing the information already contained in $z$. We show that increasing information content through refinements improves the downstream selection rules across a wide range of fairness measures (e.g. true positive rates, false positive rates, selection rates). In turn, refinements provide a simple but effective tool for reducing disparity in treatment and impact without sacrificing the utility of the predictions. Our results suggest that in many applications, the perceived "cost of fairness" results from an information disparity across populations, and thus, may be avoided with improved information. △ Less

Submitted 1 August, 2019; v1 submitted 22 April, 2019; originally announced April 2019.

Comments: Appeared at EC 2019

arXiv:1904.09489 [pdf, other]

Compression and Localization in Reinforcement Learning for ATARI Games

Authors: Joel Ruben Antony Moniz, Barun Patra, Sarthak Garg

Abstract: Deep neural networks have become commonplace in the domain of reinforcement learning, but are often expensive in terms of the number of parameters needed. While compressing deep neural networks has of late assumed great importance to overcome this drawback, little work has been done to address this problem in the context of reinforcement learning agents. This work aims at making first steps toward… ▽ More Deep neural networks have become commonplace in the domain of reinforcement learning, but are often expensive in terms of the number of parameters needed. While compressing deep neural networks has of late assumed great importance to overcome this drawback, little work has been done to address this problem in the context of reinforcement learning agents. This work aims at making first steps towards model compression in an RL agent. In particular, we compress networks to drastically reduce the number of parameters in them (to sizes less than 3% of their original size), further facilitated by applying a global max pool after the final convolution layer, and propose using Actor-Mimic in the context of compression. Finally, we show that this global max-pool allows for weakly supervised object localization, improving the ability to identify the agent's points of focus. △ Less

Submitted 20 April, 2019; originally announced April 2019.

Comments: NeurIPS 2018 Deep Reinforcement Learning Workshop

arXiv:1902.03081 [pdf, other]

Size Independent Neural Transfer for RDDL Planning

Authors: Sankalp Garg, Aniket Bajpai, Mausam

Abstract: Neural planners for RDDL MDPs produce deep reactive policies in an offline fashion. These scale well with large domains, but are sample inefficient and time-consuming to train from scratch for each new problem. To mitigate this, recent work has studied neural transfer learning, so that a generic planner trained on other problems of the same domain can rapidly transfer to a new problem. However, th… ▽ More Neural planners for RDDL MDPs produce deep reactive policies in an offline fashion. These scale well with large domains, but are sample inefficient and time-consuming to train from scratch for each new problem. To mitigate this, recent work has studied neural transfer learning, so that a generic planner trained on other problems of the same domain can rapidly transfer to a new problem. However, this approach only transfers across problems of the same size. We present the first method for neural transfer of RDDL MDPs that can transfer across problems of different sizes. Our architecture has two key innovations to achieve size independence: (1) a state encoder, which outputs a fixed length state embedding by max pooling over varying number of object embeddings, (2) a single parameter-tied action decoder that projects object embeddings into action probabilities for the final policy. On the two challenging RDDL domains of SysAdmin and Game Of Life, our approach powerfully transfers across problem sizes and has superior learning curves over training from scratch. △ Less

Submitted 4 April, 2019; v1 submitted 8 February, 2019; originally announced February 2019.

Comments: Published in ICAPS 2019

arXiv:1811.06609 [pdf, other]

A Spectral View of Adversarially Robust Features

Authors: Shivam Garg, Vatsal Sharan, Brian Hu Zhang, Gregory Valiant

Abstract: Given the apparent difficulty of learning models that are robust to adversarial perturbations, we propose tackling the simpler problem of develo** adversarially robust features. Specifically, given a dataset and metric of interest, the goal is to return a function (or multiple functions) that 1) is robust to adversarial perturbations, and 2) has significant variation across the datapoints. We es… ▽ More Given the apparent difficulty of learning models that are robust to adversarial perturbations, we propose tackling the simpler problem of develo** adversarially robust features. Specifically, given a dataset and metric of interest, the goal is to return a function (or multiple functions) that 1) is robust to adversarial perturbations, and 2) has significant variation across the datapoints. We establish strong connections between adversarially robust features and a natural spectral property of the geometry of the dataset and metric of interest. This connection can be leveraged to provide both robust features, and a lower bound on the robustness of any function that has significant variance across the dataset. Finally, we provide empirical evidence that the adversarially robust features given by this spectral approach can be fruitfully leveraged to learn a robust (and accurate) model. △ Less

Submitted 15 November, 2018; originally announced November 2018.

Comments: To appear at NIPS 2018

arXiv:1809.10610 [pdf, other]

Counterfactual Fairness in Text Classification through Robustness

Authors: Sahaj Garg, Vincent Perot, Nicole Limtiaco, Ankur Taly, Ed H. Chi, Alex Beutel

Abstract: In this paper, we study counterfactual fairness in text classification, which asks the question: How would the prediction change if the sensitive attribute referenced in the example were different? Toxicity classifiers demonstrate a counterfactual fairness issue by predicting that "Some people are gay" is toxic while "Some people are straight" is nontoxic. We offer a metric, counterfactual token f… ▽ More In this paper, we study counterfactual fairness in text classification, which asks the question: How would the prediction change if the sensitive attribute referenced in the example were different? Toxicity classifiers demonstrate a counterfactual fairness issue by predicting that "Some people are gay" is toxic while "Some people are straight" is nontoxic. We offer a metric, counterfactual token fairness (CTF), for measuring this particular form of fairness in text classifiers, and describe its relationship with group fairness. Further, we offer three approaches, blindness, counterfactual augmentation, and counterfactual logit pairing (CLP), for optimizing counterfactual token fairness during training, bridging the robustness and fairness literature. Empirically, we find that blindness and CLP address counterfactual token fairness. The methods do not harm classifier performance, and have varying tradeoffs with group fairness. These approaches, both for measurement and optimization, provide a new path forward for addressing fairness concerns in text classification. △ Less

Submitted 13 February, 2019; v1 submitted 27 September, 2018; originally announced September 2018.

Showing 1–50 of 61 results for author: Garg, S