Search | arXiv e-print repository

Unsupervised Learning under Latent Label Shift

Authors: Manley Roberts, Pranav Mani, Saurabh Garg, Zachary C. Lipton

Abstract: What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class… ▽ More What sorts of structure might enable a learner to discover classes from unlabeled data? Traditional approaches rely on feature-space similarity and heroic assumptions on the data. In this paper, we introduce unsupervised learning under Latent Label Shift (LLS), where we have access to unlabeled data from multiple domains such that the label marginals $p_d(y)$ can shift across domains but the class conditionals $p(\mathbf{x}|y)$ do not. This work instantiates a new principle for identifying classes: elements that shift together group together. For finite input spaces, we establish an isomorphism between LLS and topic modeling: inputs correspond to words, domains to documents, and labels to topics. Addressing continuous data, we prove that when each label's support contains a separable region, analogous to an anchor word, oracle access to $p(d|\mathbf{x})$ suffices to identify $p_d(y)$ and $p_d(y|\mathbf{x})$ up to permutation. Thus motivated, we introduce a practical algorithm that leverages domain-discriminative models as follows: (i) push examples through domain discriminator $p(d|\mathbf{x})$; (ii) discretize the data by clustering examples in $p(d|\mathbf{x})$ space; (iii) perform non-negative matrix factorization on the discrete data; (iv) combine the recovered $p(y|d)$ with the discriminator outputs $p(d|\mathbf{x})$ to compute $p_d(y|x) \; \forall d$. With semi-synthetic experiments, we show that our algorithm can leverage domain information to improve upon competitive unsupervised classification methods. We reveal a failure mode of standard unsupervised classification methods when feature-space similarity does not indicate true grou**s, and show empirically that our method better handles this case. Our results establish a deep connection between distribution shift and topic modeling, opening promising lines for future work. △ Less

Submitted 1 December, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

Comments: NeurIPS 2022. Manley Roberts and Pranav Mani contributed equally to this work

arXiv:2207.13048 [pdf, other]

Domain Adaptation under Open Set Label Shift

Authors: Saurabh Garg, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract: We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change arbitrarily and a new class may arrive during deployment, but the class-conditional distributions p(x|y) are domain-invariant. OSLS subsumes domain adaptation under label shift and Positive-Unlabeled (PU) learning. The learner's goals here are two-fold: (a) estimate the target la… ▽ More We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change arbitrarily and a new class may arrive during deployment, but the class-conditional distributions p(x|y) are domain-invariant. OSLS subsumes domain adaptation under label shift and Positive-Unlabeled (PU) learning. The learner's goals here are two-fold: (a) estimate the target label distribution, including the novel class; and (b) learn a target classifier. First, we establish necessary and sufficient conditions for identifying these quantities. Second, motivated by advances in label shift and PU learning, we propose practical methods for both tasks that leverage black-box predictors. Unlike typical Open Set Domain Adaptation (OSDA) problems, which tend to be ill-posed and amenable only to heuristics, OSLS offers a well-posed problem amenable to more principled machinery. Experiments across numerous semi-synthetic benchmarks on vision, language, and medical datasets demonstrate that our methods consistently outperform OSDA baselines, achieving 10--25% improvements in target domain accuracy. Finally, we analyze the proposed methods, establishing finite-sample convergence to the true label marginal and convergence to optimal classifier for linear models in a Gaussian setup. Code is available at https://github.com/acmi-lab/Open-Set-Label-Shift. △ Less

Submitted 16 October, 2022; v1 submitted 26 July, 2022; originally announced July 2022.

Comments: Accepted at NeurIPS 2022

arXiv:2206.13648 [pdf, other]

Supervised Learning with General Risk Functionals

Authors: Liu Leqi, Audrey Huang, Zachary C. Lipton, Kamyar Azizzadenesheli

Abstract: Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of Hölder… ▽ More Standard uniform convergence results bound the generalization gap of the expected loss over a hypothesis class. The emergence of risk-sensitive learning requires generalization guarantees for functionals of the loss distribution beyond the expectation. While prior works specialize in uniform convergence of particular functionals, our work provides uniform convergence for a general class of Hölder risk functionals for which the closeness in the Cumulative Distribution Function (CDF) entails closeness in risk. We establish the first uniform convergence results for estimating the CDF of the loss distribution, yielding guarantees that hold simultaneously both over all Hölder risk functionals and over all hypotheses. Thus licensed to perform empirical risk minimization, we develop practical gradient-based methods for minimizing distortion risks (widely studied subset of Hölder risks that subsumes the spectral risks, including the mean, conditional value at risk, cumulative prospect theory risks, and others) and provide convergence guarantees. In experiments, we demonstrate the efficacy of our learning procedure, both in settings where uniform convergence results hold and in high-dimensional settings with deep networks. △ Less

Submitted 27 June, 2022; originally announced June 2022.

arXiv:2206.10654 [pdf, other]

On the Maximum Hessian Eigenvalue and Generalization

Authors: Simran Kaur, Jeremy Cohen, Zachary C. Lipton

Abstract: The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of… ▽ More The mechanisms by which certain training interventions, such as increasing learning rates and applying batch normalization, improve the generalization of deep networks remains a mystery. Prior works have speculated that "flatter" solutions generalize better than "sharper" solutions to unseen data, motivating several metrics for measuring flatness (particularly $λ_{max}$, the largest eigenvalue of the Hessian of the loss); and algorithms, such as Sharpness-Aware Minimization (SAM) [1], that directly optimize for flatness. Other works question the link between $λ_{max}$ and generalization. In this paper, we present findings that call $λ_{max}$'s influence on generalization further into question. We show that: (1) while larger learning rates reduce $λ_{max}$ for all batch sizes, generalization benefits sometimes vanish at larger batch sizes; (2) by scaling batch size and learning rate simultaneously, we can change $λ_{max}$ without affecting generalization; (3) while SAM produces smaller $λ_{max}$ for all batch sizes, generalization benefits (also) vanish with larger batch sizes; (4) for dropout, excessively high dropout probabilities can degrade generalization, even as they promote smaller $λ_{max}$; and (5) while batch-normalization does not consistently produce smaller $λ_{max}$, it nevertheless confers generalization benefits. While our experiments affirm the generalization benefits of large learning rates and SAM for minibatch SGD, the GD-SGD discrepancy demonstrates limits to $λ_{max}$'s ability to explain generalization in neural networks. △ Less

Submitted 23 May, 2023; v1 submitted 21 June, 2022; originally announced June 2022.

Comments: Proceedings on "I Can't Believe It's Not Better! - Understanding Deep Learning Through Empirical Falsification" at NeurIPS 2022 Workshops, PMLR 187:51-65, 2023

arXiv:2206.04039 [pdf, ps, other]

Resolving the Human Subjects Status of Machine Learning's Crowdworkers

Authors: Divyansh Kaushik, Zachary C. Lipton, Alex John London

Abstract: In recent years, machine learning (ML) has relied heavily on crowdworkers both for building datasets and for addressing research questions requiring human interaction or judgment. The diverse tasks performed and uses of the data produced render it difficult to determine when crowdworkers are best thought of as workers (versus human subjects). These difficulties are compounded by conflicting polici… ▽ More In recent years, machine learning (ML) has relied heavily on crowdworkers both for building datasets and for addressing research questions requiring human interaction or judgment. The diverse tasks performed and uses of the data produced render it difficult to determine when crowdworkers are best thought of as workers (versus human subjects). These difficulties are compounded by conflicting policies, with some institutions and researchers regarding all ML crowdworkers as human subjects and others holding that they rarely constitute human subjects. Notably few ML papers involving crowdwork mention IRB oversight, raising the prospect of non-compliance with ethical and regulatory requirements. We investigate the appropriate designation of ML crowdsourcing studies, focusing our inquiry on natural language processing to expose unique challenges for research oversight. Crucially, under the U.S. Common Rule, these judgments hinge on determinations of aboutness, concerning both whom (or what) the collected data is about and whom (or what) the analysis is about. We highlight two challenges posed by ML: the same set of workers can serve multiple roles and provide many sorts of information; and ML research tends to embrace a dynamic workflow, where research questions are seldom stated ex ante and data sharing opens the door for future studies to aim questions at different targets. Our analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies. Finally, we offer several policy recommendations to address these concerns. △ Less

Submitted 15 June, 2023; v1 submitted 8 June, 2022; originally announced June 2022.

arXiv:2205.09701 [pdf, other]

Homophily and Incentive Effects in Use of Algorithms

Authors: Riccardo Fogliato, Sina Fazelpour, Shantanu Gupta, Zachary Lipton, David Danks

Abstract: As algorithmic tools increasingly aid experts in making consequential decisions, the need to understand the precise factors that mediate their influence has grown commensurately. In this paper, we present a crowdsourcing vignette study designed to assess the impacts of two plausible factors on AI-informed decision-making. First, we examine homophily -- do people defer more to models that tend to a… ▽ More As algorithmic tools increasingly aid experts in making consequential decisions, the need to understand the precise factors that mediate their influence has grown commensurately. In this paper, we present a crowdsourcing vignette study designed to assess the impacts of two plausible factors on AI-informed decision-making. First, we examine homophily -- do people defer more to models that tend to agree with them? -- by manipulating the agreement during training between participants and the algorithmic tool. Second, we considered incentives -- how do people incorporate a (known) cost structure in the hybrid decision-making setting? -- by varying rewards associated with true positives vs. true negatives. Surprisingly, we found limited influence of either homophily and no evidence of incentive effects, despite participants performing similarly to previous studies. Higher levels of agreement between the participant and the AI tool yielded more confident predictions, but only when outcome feedback was absent. These results highlight the complexity of characterizing human-algorithm interactions, and suggest that findings from social psychology may require re-examination when humans interact with algorithms. △ Less

Submitted 19 May, 2022; originally announced May 2022.

Comments: Accepted at CogSci, 2022

arXiv:2203.13423 [pdf, ps, other]

Modeling Attrition in Recommender Systems with Departing Bandits

Authors: Omer Ben-Porat, Lee Cohen, Liu Leqi, Zachary C. Lipton, Yishay Mansour

Abstract: Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a fini… ▽ More Traditionally, when recommender systems are formalized as multi-armed bandits, the policy of the recommender system influences the rewards accrued, but not the length of interaction. However, in real-world systems, dissatisfied users may depart (and never come back). In this work, we propose a novel multi-armed bandit setup that captures such policy-dependent horizons. Our setup consists of a finite set of user types, and multiple arms with Bernoulli payoffs. Each (user type, arm) tuple corresponds to an (unknown) reward probability. Each user's type is initially unknown and can only be inferred through their response to recommendations. Moreover, if a user is dissatisfied with their recommendation, they might depart the system. We first address the case where all users share the same type, demonstrating that a recent UCB-based algorithm is optimal. We then move forward to the more challenging case, where users are divided among two types. While naive approaches cannot handle this setting, we provide an efficient learning algorithm that achieves $\tilde{O}(\sqrt{T})$ regret, where $T$ is the number of users. △ Less

Submitted 15 February, 2024; v1 submitted 24 March, 2022; originally announced March 2022.

Comments: Accepted at AAAI 2022

arXiv:2202.01336 [pdf, other]

Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation

Authors: Yi-Fan Zhang, Hanlin Zhang, Zachary C. Lipton, Li Erran Li, Eric P. Xing

Abstract: Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent work uses multilayer perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generaliz… ▽ More Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent work uses multilayer perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general purpose treatment effect estimator that significantly outperforms competitive baselines in a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models △ Less

Submitted 17 October, 2022; v1 submitted 2 February, 2022; originally announced February 2022.

arXiv:2201.04234 [pdf, other]

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Authors: Saurabh Garg, Sivaraman Balakrishnan, Zachary C. Lipton, Behnam Neyshabur, Hanie Sedghi

Abstract: Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on… ▽ More Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works. Code is available at https://github.com/saurabhgarg1996/ATC_code/. △ Less

Submitted 14 October, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

Comments: Accepted at ICLR 2022

arXiv:2112.09669 [pdf, other]

Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations

Authors: Siddhant Arora, Danish Pruthi, Norman Sadeh, William W. Cohen, Zachary C. Lipton, Graham Neubig

Abstract: In attempts to "explain" predictions of machine learning models, researchers have proposed hundreds of techniques for attributing predictions to features that are deemed important. While these attributions are often claimed to hold the potential to improve human "understanding" of the models, surprisingly little work explicitly evaluates progress towards this aspiration. In this paper, we conduct… ▽ More In attempts to "explain" predictions of machine learning models, researchers have proposed hundreds of techniques for attributing predictions to features that are deemed important. While these attributions are often claimed to hold the potential to improve human "understanding" of the models, surprisingly little work explicitly evaluates progress towards this aspiration. In this paper, we conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. They are challenged both to simulate the model on fresh reviews, and to edit reviews with the goal of lowering the probability of the originally predicted class. Successful manipulations would lead to an adversarial example. During the training (but not the test) phase, input spans are highlighted to communicate salience. Through our evaluation, we observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control. For the BERT-based classifier, popular local explanations do not improve their ability to reduce the model confidence over the no-explanation case. Remarkably, when the explanation for the BERT model is given by the (global) attributions of a linear model trained to imitate the BERT model, people can effectively manipulate the model. △ Less

Submitted 21 August, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: AAAI 2022

arXiv:2111.00980 [pdf, other]

Mixture Proportion Estimation and PU Learning: A Modern Approach

Authors: Saurabh Garg, Yifan Wu, Alex Smola, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract: Given only positive examples and unlabeled examples (from both positive and negative classes), we might hope nevertheless to estimate an accurate positive-versus-negative classifier. Formally, this task is broken down into two subtasks: (i) Mixture Proportion Estimation (MPE) -- determining the fraction of positive examples in the unlabeled data; and (ii) PU-learning -- given such an estimate, lea… ▽ More Given only positive examples and unlabeled examples (from both positive and negative classes), we might hope nevertheless to estimate an accurate positive-versus-negative classifier. Formally, this task is broken down into two subtasks: (i) Mixture Proportion Estimation (MPE) -- determining the fraction of positive examples in the unlabeled data; and (ii) PU-learning -- given such an estimate, learning the desired positive-versus-negative classifier. Unfortunately, classical methods for both problems break down in high-dimensional settings. Meanwhile, recently proposed heuristics lack theoretical coherence and depend precariously on hyperparameter tuning. In this paper, we propose two simple techniques: Best Bin Estimation (BBE) (for MPE); and Conditional Value Ignoring Risk (CVIR), a simple objective for PU-learning. Both methods dominate previous approaches empirically, and for BBE, we establish formal guarantees that hold whenever we can train a model to cleanly separate out a small subset of positive examples. Our final algorithm (TED)$^n$, alternates between the two procedures, significantly improving both our mixture proportion estimator and classifier △ Less

Submitted 1 November, 2021; originally announced November 2021.

Comments: Spotlight at NeurIPS 2021

arXiv:2110.07566 [pdf, other]

Practical Benefits of Feature Feedback Under Distribution Shift

Authors: Anurag Katakkar, Clay H. Yoo, Weiqin Wang, Zachary C. Lipton, Divyansh Kaushik

Abstract: In attempts to develop sample-efficient and interpretable algorithms, researcher have explored myriad mechanisms for collecting and exploiting feature feedback (or rationales) auxiliary annotations provided for training (but not test) instances that highlight salient evidence. Examples include bounding boxes around objects and salient spans in text. Despite its intuitive appeal, feature feedback h… ▽ More In attempts to develop sample-efficient and interpretable algorithms, researcher have explored myriad mechanisms for collecting and exploiting feature feedback (or rationales) auxiliary annotations provided for training (but not test) instances that highlight salient evidence. Examples include bounding boxes around objects and salient spans in text. Despite its intuitive appeal, feature feedback has not delivered significant gains in practical problems as assessed on iid holdout sets. However, recent works on counterfactually augmented data suggest an alternative benefit of supplemental annotations, beyond interpretability: lessening sensitivity to spurious patterns and consequently delivering gains in out-of-domain evaluations. We speculate that while existing methods for incorporating feature feedback have delivered negligible in-sample performance gains, they may nevertheless provide out-of-domain benefits. Our experiments addressing sentiment analysis, show that feature feedback methods perform significantly better on various natural out-of-domain datasets despite comparable in-domain evaluations. By contrast, performance on natural language inference remains comparable. Finally, we compare those tasks where feature feedback does (and does not) help. △ Less

Submitted 17 October, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

arXiv:2109.04953 [pdf, other]

Does Pretraining for Summarization Require Knowledge Transfer?

Authors: Kundan Krishna, Jeffrey Bigham, Zachary C. Lipton

Abstract: Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of chara… ▽ More Pretraining techniques leveraging enormous datasets have driven recent advances in text summarization. While folk explanations suggest that knowledge transfer accounts for pretraining's benefits, little is known about why it works or what makes a pretraining task or dataset suitable. In this paper, we challenge the knowledge transfer story, showing that pretraining on documents consisting of character n-grams selected at random, we can nearly match the performance of models pretrained on real corpora. This work holds the promise of eliminating upstream corpora, which may alleviate some concerns over offensive language, bias, and copyright issues. To see whether the small residual benefit of using real data could be accounted for by the structure of the pretraining task, we design several tasks motivated by a qualitative study of summarization corpora. However, these tasks confer no appreciable benefit, leaving open the possibility of a small role for knowledge transfer. △ Less

Submitted 10 September, 2021; originally announced September 2021.

Comments: Camera-ready for Findings of EMNLP 2021

arXiv:2109.01443 [pdf, other]

The Impact of Algorithmic Risk Assessments on Human Predictions and its Analysis via Crowdsourcing Studies

Authors: Riccardo Fogliato, Alexandra Chouldechova, Zachary Lipton

Abstract: As algorithmic risk assessment instruments (RAIs) are increasingly adopted to assist decision makers, their predictive performance and potential to promote inequity have come under scrutiny. However, while most studies examine these tools in isolation, researchers have come to recognize that assessing their impact requires understanding the behavior of their human interactants. In this paper, buil… ▽ More As algorithmic risk assessment instruments (RAIs) are increasingly adopted to assist decision makers, their predictive performance and potential to promote inequity have come under scrutiny. However, while most studies examine these tools in isolation, researchers have come to recognize that assessing their impact requires understanding the behavior of their human interactants. In this paper, building off of several recent crowdsourcing works focused on criminal justice, we conduct a vignette study in which laypersons are tasked with predicting future re-arrests. Our key findings are as follows: (1) Participants often predict that an offender will be rearrested even when they deem the likelihood of re-arrest to be well below 50%; (2) Participants do not anchor on the RAI's predictions; (3) The time spent on the survey varies widely across participants and most cases are assessed in less than 10 seconds; (4) Judicial decisions, unlike participants' predictions, depend in part on factors that are orthogonal to the likelihood of re-arrest. These results highlight the influence of several crucial but often overlooked design decisions and concerns around generalizability when constructing crowdsourcing studies to analyze the impacts of RAIs. △ Less

Submitted 3 September, 2021; originally announced September 2021.

Comments: Proceedings of the ACM on Human-Computer Interaction 5, CSCW2, Article 428 (October 2021)

arXiv:2108.09265 [pdf, other]

Efficient Online Estimation of Causal Effects by Deciding What to Observe

Authors: Shantanu Gupta, Zachary C. Lipton, David Childers

Abstract: Researchers often face data fusion problems, where multiple data sources are available, each capturing a distinct subset of variables. While problem formulations typically take the data as given, in practice, data acquisition can be an ongoing process. In this paper, we aim to estimate any functional of a probabilistic model (e.g., a causal effect) as efficiently as possible, by deciding, at each… ▽ More Researchers often face data fusion problems, where multiple data sources are available, each capturing a distinct subset of variables. While problem formulations typically take the data as given, in practice, data acquisition can be an ongoing process. In this paper, we aim to estimate any functional of a probabilistic model (e.g., a causal effect) as efficiently as possible, by deciding, at each time, which data source to query. We propose online moment selection (OMS), a framework in which structural assumptions are encoded as moment conditions. The optimal action at each step depends, in part, on the very moments that identify the functional of interest. Our algorithms balance exploration with choosing the best action as suggested by current estimates of the moments. We propose two selection strategies: (1) explore-then-commit (OMS-ETC) and (2) explore-then-greedy (OMS-ETG), proving that both achieve zero asymptotic regret as assessed by MSE. We instantiate our setup for average treatment effect estimation, where structural assumptions are given by a causal graph and data sources may include subsets of mediators, confounders, and instrumental variables. △ Less

Submitted 30 October, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

Comments: Accepted at NeurIPS 2021

arXiv:2107.00441 [pdf, ps, other]

When Curation Becomes Creation: Algorithms, Microcontent, and the Vanishing Distinction between Platforms and Creators

Authors: Liu Leqi, Dylan Hadfield-Menell, Zachary C. Lipton

Abstract: Ever since social activity on the Internet began migrating from the wilds of the open web to the walled gardens erected by so-called platforms, debates have raged about the responsibilities that these platforms ought to bear. And yet, despite intense scrutiny from the news media and grassroots movements of outraged users, platforms continue to operate, from a legal standpoint, on the friendliest t… ▽ More Ever since social activity on the Internet began migrating from the wilds of the open web to the walled gardens erected by so-called platforms, debates have raged about the responsibilities that these platforms ought to bear. And yet, despite intense scrutiny from the news media and grassroots movements of outraged users, platforms continue to operate, from a legal standpoint, on the friendliest terms. Under the current regulatory framework, platforms simultaneously benefit from: (1) broad discretion to organize (and censor) content however they choose; (2) powerful algorithms for curating a practically limitless supply of user-posted microcontent according to whatever ends they wish; and (3) absolution from the sorts of liability born by creators of the underlying content. In this paper, we contest the very validity of the platform-creator distinction, arguing that it is ill-adapted to the modern social media landscape where, in a real sense, platforms are creating derivative media products. We argue that any coherent regulatory framework must adapt to this reality, recognizing the subtle continuum of activities that span the curation-creation spectrum, providing a finer system of categorization and clearer guidance for precisely when platforms assume the responsibilities associated with content creation. △ Less

Submitted 1 July, 2021; originally announced July 2021.

arXiv:2106.11342 [pdf]

Dive into Deep Learning

Authors: Aston Zhang, Zachary C. Lipton, Mu Li, Alexander J. Smola

Abstract: This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical… ▽ More This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions. △ Less

Submitted 22 August, 2023; v1 submitted 21 June, 2021; originally announced June 2021.

Comments: (HTML) https://D2L.ai (GitHub) https://github.com/d2l-ai/d2l-en/

arXiv:2106.07041 [pdf, other]

Correcting Exposure Bias for Link Recommendation

Authors: Shantanu Gupta, Hao Wang, Zachary C. Lipton, Yuyang Wang

Abstract: Link prediction methods are frequently applied in recommender systems, e.g., to suggest citations for academic papers or friends in social networks. However, exposure bias can arise when users are systematically underexposed to certain relevant items. For example, in citation networks, authors might be more likely to encounter papers from their own field and thus cite them preferentially. This bia… ▽ More Link prediction methods are frequently applied in recommender systems, e.g., to suggest citations for academic papers or friends in social networks. However, exposure bias can arise when users are systematically underexposed to certain relevant items. For example, in citation networks, authors might be more likely to encounter papers from their own field and thus cite them preferentially. This bias can propagate through naively trained link predictors, leading to both biased evaluation and high generalization error (as assessed by true relevance). Moreover, this bias can be exacerbated by feedback loops. We propose estimators that leverage known exposure probabilities to mitigate this bias and consequent feedback loops. Next, we provide a loss function for learning the exposure probabilities from data. Finally, experiments on semi-synthetic data based on real-world citation networks, show that our methods reliably identify (truly) relevant citations. Additionally, our methods lead to greater diversity in the recommended papers' fields of study. The code is available at https://github.com/shantanu95/exposure-bias-link-rec. △ Less

Submitted 13 June, 2021; originally announced June 2021.

arXiv:2106.00872 [pdf, other]

On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

Authors: Divyansh Kaushik, Douwe Kiela, Zachary C. Lipton, Wen-tau Yih

Abstract: In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Researchers hope that models trained on these more challenging datasets will rely less on superficial patterns, and thus be less brittle. However, despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produ… ▽ More In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Researchers hope that models trained on these more challenging datasets will rely less on superficial patterns, and thus be less brittle. However, despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models. In this paper, we conduct a large-scale controlled study focused on question answering, assigning workers at random to compose questions either (i) adversarially (with a model in the loop); or (ii) in the standard fashion (without a model). Across a variety of models and datasets, we find that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets. Finally, we provide a qualitative analysis of adversarial (vs standard) data, identifying key differences and offering guidance for future research. △ Less

Submitted 1 June, 2021; originally announced June 2021.

Comments: Accepted at ACL-IJCNLP 2021

arXiv:2105.04953 [pdf, other]

On the Validity of Arrest as a Proxy for Offense: Race and the Likelihood of Arrest for Violent Crimes

Authors: Riccardo Fogliato, Alice Xiang, Zachary Lipton, Daniel Nagin, Alexandra Chouldechova

Abstract: The risk of re-offense is considered in decision-making at many stages of the criminal justice system, from pre-trial, to sentencing, to parole. To aid decision makers in their assessments, institutions increasingly rely on algorithmic risk assessment instruments (RAIs). These tools assess the likelihood that an individual will be arrested for a new criminal offense within some time window followi… ▽ More The risk of re-offense is considered in decision-making at many stages of the criminal justice system, from pre-trial, to sentencing, to parole. To aid decision makers in their assessments, institutions increasingly rely on algorithmic risk assessment instruments (RAIs). These tools assess the likelihood that an individual will be arrested for a new criminal offense within some time window following their release. However, since not all crimes result in arrest, RAIs do not directly assess the risk of re-offense. Furthermore, disparities in the likelihood of arrest can potentially lead to biases in the resulting risk scores. Several recent validations of RAIs have therefore focused on arrests for violent offenses, which are viewed as being more accurate reflections of offending behavior. In this paper, we investigate biases in violent arrest data by analysing racial disparities in the likelihood of arrest for White and Black violent offenders. We focus our study on 2007--2016 incident-level data of violent offenses from 16 US states as recorded in the National Incident Based Reporting System (NIBRS). Our analysis shows that the magnitude and direction of the racial disparities depend on various characteristics of the crimes. In addition, our investigation reveals large variations in arrest rates across geographical locations and offense types. We discuss the implications of the observed disconnect between re-arrest and re-offense in the context of RAIs and the challenges around the use of data from NIBRS to correct for the sampling bias. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Comments: Accepted at AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2021

arXiv:2105.00303 [pdf, other]

RATT: Leveraging Unlabeled Data to Guarantee Generalization

Authors: Saurabh Garg, Sivaraman Balakrishnan, J. Zico Kolter, Zachary C. Lipton

Abstract: To assess generalization, machine learning scientists typically either (i) bound the generalization gap and then (after training) plug in the empirical risk to obtain a bound on the true risk; or (ii) validate empirically on holdout data. However, (i) typically yields vacuous guarantees for overparameterized models. Furthermore, (ii) shrinks the training set and its guarantee erodes with each re-u… ▽ More To assess generalization, machine learning scientists typically either (i) bound the generalization gap and then (after training) plug in the empirical risk to obtain a bound on the true risk; or (ii) validate empirically on holdout data. However, (i) typically yields vacuous guarantees for overparameterized models. Furthermore, (ii) shrinks the training set and its guarantee erodes with each re-use of the holdout set. In this paper, we introduce a method that leverages unlabeled data to produce generalization bounds. After augmenting our (labeled) training set with randomly labeled fresh examples, we train in the standard fashion. Whenever classifiers achieve low error on clean data and high error on noisy data, our bound provides a tight upper bound on the true risk. We prove that our bound is valid for 0-1 empirical risk minimization and with linear classifiers trained by gradient descent. Our approach is especially useful in conjunction with deep learning due to the early learning phenomenon whereby networks fit true labels before noisy labels but requires one intuitive assumption. Empirically, on canonical computer vision and NLP tasks, our bound provides non-vacuous generalization guarantees that track actual performance closely. This work provides practitioners with an option for certifying the generalization of deep nets even when unseen labeled data is unavailable and provides theoretical insights into the relationship between random label noise and generalization. △ Less

Submitted 6 November, 2021; v1 submitted 1 May, 2021; originally announced May 2021.

Comments: ICML 2021 (Long Talk)

arXiv:2104.08977 [pdf, other]

Off-Policy Risk Assessment in Contextual Bandits

Authors: Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli

Abstract: Even when unable to run experiments, practitioners can evaluate prospective policies, using previously logged data. However, while the bandits literature has adopted a diverse set of objectives, most research on off-policy evaluation to date focuses on the expected reward. In this paper, we introduce Lipschitz risk functionals, a broad class of objectives that subsumes conditional value-at-risk (C… ▽ More Even when unable to run experiments, practitioners can evaluate prospective policies, using previously logged data. However, while the bandits literature has adopted a diverse set of objectives, most research on off-policy evaluation to date focuses on the expected reward. In this paper, we introduce Lipschitz risk functionals, a broad class of objectives that subsumes conditional value-at-risk (CVaR), variance, mean-variance, many distorted risks, and CPT risks, among others. We propose Off-Policy Risk Assessment (OPRA), a framework that first estimates a target policy's CDF and then generates plugin estimates for any collection of Lipschitz risks, providing finite sample guarantees that hold simultaneously over the entire class. We instantiate OPRA with both importance sampling and doubly robust estimators. Our primary theoretical contributions are (i) the first uniform concentration inequalities for both CDF estimators in contextual bandits and (ii) error bounds on our Lipschitz risk estimates, which all converge at a rate of $O(1/\sqrt{n})$. △ Less

Submitted 29 June, 2021; v1 submitted 18 April, 2021; originally announced April 2021.

arXiv:2103.02827 [pdf, other]

On the Convergence and Optimality of Policy Gradient for Markov Coherent Risk

Authors: Audrey Huang, Liu Leqi, Zachary C. Lipton, Kamyar Azizzadenesheli

Abstract: In order to model risk aversion in reinforcement learning, an emerging line of research adapts familiar algorithms to optimize coherent risk functionals, a class that includes conditional value-at-risk (CVaR). Because optimizing the coherent risk is difficult in Markov decision processes, recent work tends to focus on the Markov coherent risk (MCR), a time-consistent surrogate. While, policy gradi… ▽ More In order to model risk aversion in reinforcement learning, an emerging line of research adapts familiar algorithms to optimize coherent risk functionals, a class that includes conditional value-at-risk (CVaR). Because optimizing the coherent risk is difficult in Markov decision processes, recent work tends to focus on the Markov coherent risk (MCR), a time-consistent surrogate. While, policy gradient (PG) updates have been derived for this objective, it remains unclear (i) whether PG finds a global optimum for MCR; (ii) how to estimate the gradient in a tractable manner. In this paper, we demonstrate that, in general, MCR objectives (unlike the expected return) are not gradient dominated and that stationary points are not, in general, guaranteed to be globally optimal. Moreover, we present a tight upper bound on the suboptimality of the learned policy, characterizing its dependence on the nonlinearity of the objective and the degree of risk aversion. Addressing (ii), we propose a practical implementation of PG that uses state distribution reweighting to overcome previous limitations. Through experiments, we demonstrate that when the optimality gap is small, PG can learn risk-sensitive policies. However, we find that instances with large suboptimality gaps are abundant and easy to construct, outlining an important challenge for future research. △ Less

Submitted 5 March, 2021; v1 submitted 3 March, 2021; originally announced March 2021.

arXiv:2103.02138 [pdf, ps, other]

Parametric Complexity Bounds for Approximating PDEs with Neural Networks

Authors: Tanya Marwah, Zachary C. Lipton, Andrej Risteski

Abstract: Recent experiments have shown that deep networks can approximate solutions to high-dimensional PDEs, seemingly esca** the curse of dimensionality. However, questions regarding the theoretical basis for such approximations, including the required network size, remain open. In this paper, we investigate the representational power of neural networks for approximating solutions to linear elliptic PD… ▽ More Recent experiments have shown that deep networks can approximate solutions to high-dimensional PDEs, seemingly esca** the curse of dimensionality. However, questions regarding the theoretical basis for such approximations, including the required network size, remain open. In this paper, we investigate the representational power of neural networks for approximating solutions to linear elliptic PDEs with Dirichlet boundary conditions. We prove that when a PDE's coefficients are representable by small neural networks, the parameters required to approximate its solution scale polynomially with the input dimension $d$ and proportionally to the parameter counts of the coefficient networks. To this we end, we develop a proof technique that simulates gradient descent (in an appropriate Hilbert space) by growing a neural network architecture whose iterates each participate as sub-networks in their (slightly larger) successors, and converge to the solution of the PDE. We bound the size of the solution, showing a polynomial dependence on $d$ and no dependence on the volume of the domain. △ Less

Submitted 6 July, 2021; v1 submitted 2 March, 2021; originally announced March 2021.

arXiv:2102.10264 [pdf, other]

On Proximal Policy Optimization's Heavy-tailed Gradients

Authors: Saurabh Garg, Joshua Zhanson, Emilio Parisotto, Adarsh Prasad, J. Zico Kolter, Zachary C. Lipton, Sivaraman Balakrishnan, Ruslan Salakhutdinov, Pradeep Ravikumar

Abstract: Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clip** and gradient clip**, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (``heavy-tailed'') regimes. In this paper, we present a detailed empirical study to characteriz… ▽ More Modern policy gradient algorithms such as Proximal Policy Optimization (PPO) rely on an arsenal of heuristics, including loss clip** and gradient clip**, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich (``heavy-tailed'') regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate that the gradients, especially for the actor network, exhibit pronounced heavy-tailedness and that it increases as the agent's policy diverges from the behavioral policy (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. We then highlight issues arising due to the heavy-tailed nature of the gradients. In this light, we study the effects of the standard PPO clip** heuristics, demonstrating that these tricks primarily serve to offset heavy-tailedness in gradients. Thus motivated, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clip** tricks. Despite requiring less hyperparameter tuning, our method matches the performance of PPO (with all heuristics enabled) on a battery of MuJoCo continuous control tasks. △ Less

Submitted 12 July, 2021; v1 submitted 20 February, 2021; originally announced February 2021.

Comments: ICML 2021

arXiv:2012.04825 [pdf, other]

Unpacking the Drop in COVID-19 Case Fatality Rates: A Study of National and Florida Line-Level Data

Authors: Cheng Cheng, Helen Zhou, Jeremy C. Weiss, Zachary C. Lipton

Abstract: Since the COVID-19 pandemic first reached the United States, the case fatality rate has fallen precipitously. Several possible explanations have been floated, including greater detection of mild cases due to expanded testing, shifts in age distribution among the infected, lags between confirmed cases and reported deaths, improvements in treatment, mutations in the virus, and decreased viral load a… ▽ More Since the COVID-19 pandemic first reached the United States, the case fatality rate has fallen precipitously. Several possible explanations have been floated, including greater detection of mild cases due to expanded testing, shifts in age distribution among the infected, lags between confirmed cases and reported deaths, improvements in treatment, mutations in the virus, and decreased viral load as a result of mask-wearing. Using both Florida line-level data and recently released (but incomplete) national line level data from April 1, 2020 to November 1, 2020 on cases, hospitalizations, and deaths--each stratified by age--we unpack the drop in case fatality rate (CFR). Under the hypothesis that improvements in treatment efficacy should correspond to decreases in hospitalization fatality rate (HFR), we find that improvements in the national data do not always match the story told by Florida data. In the national data, treatment improvements between the first wave and the second wave appear substantial, but modest when compared to the drop in aggregate CFR. By contrast, possibly due to constrained resources in a much larger second peak, Florida data suggests comparatively little difference between the first and second wave, with HFR slightly increasing in every age group. However, by November 1st, both Florida and national data suggest significant decreases in age-stratified HFR since April 1st. By accounting for several confounding factors, our analysis shows how age-stratified HFR can provide a more realistic picture of treatment improvements than CFR. One key limitation of our analysis is that the national line-level data remains incomplete and plagued by artifacts. Our analysis highlights the crucial role that this data can play but also the pressing need for public, complete, and high-quality age-stratified line-level data for both cases, hospitalizations, and deaths for all states. △ Less

Submitted 11 December, 2020; v1 submitted 8 December, 2020; originally announced December 2020.

Comments: 24 pages, 13 figures

arXiv:2012.00893 [pdf, other]

Evaluating Explanations: How much do explanations from the teacher aid students?

Authors: Danish Pruthi, Rachit Bansal, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C. Lipton, Graham Neubig, William W. Cohen

Abstract: While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the stude… ▽ More While many methods purport to explain predictions by highlighting salient features, what aims these explanations serve and how they ought to be evaluated often go unstated. In this work, we introduce a framework to quantify the value of explanations via the accuracy gains that they confer on a student model trained to simulate a teacher model. Crucially, the explanations are available to the student during training, but are not available at test time. Compared to prior proposals, our approach is less easily gamed, enabling principled, automatic, model-agnostic evaluation of attributions. Using our framework, we compare numerous attribution methods for text classification and question answering, and observe quantitative differences that are consistent (to a moderate to high degree) across different student model architectures and learning strategies. △ Less

Submitted 16 December, 2021; v1 submitted 1 December, 2020; originally announced December 2020.

Comments: TACL 2021 (pre-MIT Press publication version)

arXiv:2011.13477 [pdf, other]

Decoding and Diversity in Machine Translation

Authors: Nicholas Roberts, Davis Liang, Graham Neubig, Zachary C. Lipton

Abstract: Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers employ a variety of heuristic techniques, including searching for the conditional mode (vs. sampling) and incorporating various training heuristics (e.g., labe… ▽ More Neural Machine Translation (NMT) systems are typically evaluated using automated metrics that assess the agreement between generated translations and ground truth candidates. To improve systems with respect to these metrics, NLP researchers employ a variety of heuristic techniques, including searching for the conditional mode (vs. sampling) and incorporating various training heuristics (e.g., label smoothing). While search strategies significantly improve BLEU score, they yield deterministic outputs that lack the diversity of human translations. Moreover, search tends to bias the distribution of translated gender pronouns. This makes human-level BLEU a misleading benchmark in that modern MT systems cannot approach human-level BLEU while simultaneously maintaining human-level translation diversity. In this paper, we characterize distributional differences between generated and real translations, examining the cost in diversity paid for the BLEU scores enjoyed by NMT. Moreover, our study implicates search as a salient source of known bias when translating gender pronouns. △ Less

Submitted 26 November, 2020; originally announced November 2020.

Comments: Presented at the Resistance AI Workshop, 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada

arXiv:2011.06741 [pdf, other]

Rebounding Bandits for Modeling Satiation Effects

Authors: Liu Leqi, Fatma Kilinc-Karzan, Zachary C. Lipton, Alan L. Montgomery

Abstract: Psychological research shows that enjoyment of many goods is subject to satiation, with short-term satisfaction declining after repeated exposures to the same item. Nevertheless, proposed algorithms for powering recommender systems seldom model these dynamics, instead proceeding as though user preferences were fixed in time. In this work, we introduce rebounding bandits, a multi-armed bandit setup… ▽ More Psychological research shows that enjoyment of many goods is subject to satiation, with short-term satisfaction declining after repeated exposures to the same item. Nevertheless, proposed algorithms for powering recommender systems seldom model these dynamics, instead proceeding as though user preferences were fixed in time. In this work, we introduce rebounding bandits, a multi-armed bandit setup, where satiation dynamics are modeled as time-invariant linear dynamical systems. Expected rewards for each arm decline monotonically with consecutive exposures to it and rebound towards the initial reward whenever that arm is not pulled. Unlike classical bandit settings, methods for tackling rebounding bandits must plan ahead and model-based methods rely on estimating the parameters of the satiation dynamics. We characterize the planning problem, showing that the greedy policy is optimal when the arms exhibit identical deterministic dynamics. To address stochastic satiation dynamics with unknown parameters, we propose Explore-Estimate-Plan (EEP), an algorithm that pulls arms methodically, estimates the system dynamics, and then plans accordingly. △ Less

Submitted 27 October, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

arXiv:2011.03654 [pdf, other]

doi 10.1145/3461702.3462521

Fair Machine Learning Under Partial Compliance

Authors: Jessica Dai, Sina Fazelpour, Zachary C. Lipton

Abstract: Typically, fair machine learning research focuses on a single decisionmaker and assumes that the underlying population is stationary. However, many of the critical domains motivating this work are characterized by competitive marketplaces with many decisionmakers. Realistically, we might expect only a subset of them to adopt any non-compulsory fairness-conscious policy, a situation that political… ▽ More Typically, fair machine learning research focuses on a single decisionmaker and assumes that the underlying population is stationary. However, many of the critical domains motivating this work are characterized by competitive marketplaces with many decisionmakers. Realistically, we might expect only a subset of them to adopt any non-compulsory fairness-conscious policy, a situation that political philosophers call partial compliance. This possibility raises important questions: how does the strategic behavior of decision subjects in partial compliance settings affect the allocation outcomes? If k% of employers were to voluntarily adopt a fairness-promoting intervention, should we expect k% progress (in aggregate) towards the benefits of universal adoption, or will the dynamics of partial compliance wash out the hoped-for benefits? How might adopting a global (versus local) perspective impact the conclusions of an auditor? In this paper, we propose a simple model of an employment market, leveraging simulation as a tool to explore the impact of both interaction effects and incentive effects on outcomes and auditing metrics. Our key findings are that at equilibrium: (1) partial compliance (k% of employers) can result in far less than proportional (k%) progress towards the full compliance outcomes; (2) the gap is more severe when fair employers match global (vs local) statistics; (3) choices of local vs global statistics can paint dramatically different pictures of the performance vis-a-vis fairness desiderata of compliant versus non-compliant employers; and (4) partial compliance to local parity measures can induce extreme segregation. △ Less

Submitted 26 September, 2022; v1 submitted 6 November, 2020; originally announced November 2020.

Comments: Presented at AIES 2021; previously at the NeurIPS 2020 Workshop on Consequential Decision Making in Dynamic Environments and the NeurIPS 2020 Workshop on ML for Economic Policy. Minor correction uploaded Sept. 2022

arXiv:2011.01459 [pdf, other]

Weakly- and Semi-supervised Evidence Extraction

Authors: Danish Pruthi, Bhuwan Dhingra, Graham Neubig, Zachary C. Lipton

Abstract: For many prediction tasks, stakeholders desire not only predictions but also supporting evidence that a human can use to verify its correctness. However, in practice, additional annotations marking supporting evidence may only be available for a minority of training examples (if available at all). In this paper, we propose new methods to combine few evidence annotations (strong semi-supervision) w… ▽ More For many prediction tasks, stakeholders desire not only predictions but also supporting evidence that a human can use to verify its correctness. However, in practice, additional annotations marking supporting evidence may only be available for a minority of training examples (if available at all). In this paper, we propose new methods to combine few evidence annotations (strong semi-supervision) with abundant document-level labels (weak supervision) for the task of evidence extraction. Evaluating on two classification tasks that feature evidence annotations, we find that our methods outperform baselines adapted from the interpretability literature to our task. Our approach yields substantial gains with as few as hundred evidence annotations. Code and datasets to reproduce our work are available at https://github.com/danishpruthi/evidence-extraction. △ Less

Submitted 2 November, 2020; originally announced November 2020.

Comments: Accepted to the Findings of EMNLP 2020, to be presented at BlackBoxNLP

arXiv:2010.11966 [pdf, other]

Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data

Authors: David Lowell, Brian E. Howard, Zachary C. Lipton, Byron C. Wallace

Abstract: Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a model's predictions on (a) observed (unlabeled) examples; and (b) corresponding 'noised' examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and over how to… ▽ More Unsupervised Data Augmentation (UDA) is a semi-supervised technique that applies a consistency loss to penalize differences between a model's predictions on (a) observed (unlabeled) examples; and (b) corresponding 'noised' examples produced via data augmentation. While UDA has gained popularity for text classification, open questions linger over which design decisions are necessary and over how to extend the method to sequence labeling tasks. This method has recently gained traction for text classification. In this paper, we re-examine UDA and demonstrate its efficacy on several sequential tasks. Our main contribution is an empirical study of UDA to establish which components of the algorithm confer benefits in NLP. Notably, although prior work has emphasized the use of clever augmentation techniques including back-translation, we find that enforcing consistency between predictions assigned to observed and randomly substituted words often yields comparable (or greater) benefits compared to these complex perturbation models. Furthermore, we find that applying its consistency loss affords meaningful gains without any unlabeled data at all, i.e., in a standard supervised setting. In short: UDA need not be unsupervised, and does not require complex data augmentation to be effective. △ Less

Submitted 22 October, 2020; originally announced October 2020.

arXiv:2010.03017 [pdf, other]

On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment

Authors: Zirui Wang, Zachary C. Lipton, Yulia Tsvetkov

Abstract: Modern multilingual models are trained on concatenated text from multiple languages in hopes of conferring benefits to each (positive transfer), with the most pronounced benefits accruing to low-resource languages. However, recent work has shown that this approach can degrade performance on high-resource languages, a phenomenon known as negative interference. In this paper, we present the first sy… ▽ More Modern multilingual models are trained on concatenated text from multiple languages in hopes of conferring benefits to each (positive transfer), with the most pronounced benefits accruing to low-resource languages. However, recent work has shown that this approach can degrade performance on high-resource languages, a phenomenon known as negative interference. In this paper, we present the first systematic study of negative interference. We show that, contrary to previous belief, negative interference also impacts low-resource languages. While parameters are maximally shared to learn language-universal structures, we demonstrate that language-specific parameters do exist in multilingual models and they are a potential cause of negative interference. Motivated by these observations, we also present a meta-learning algorithm that obtains better cross-lingual transferability and alleviates negative interference, by adding language-specific layers as meta-parameters and training them in a manner that explicitly improves shared layers' generalization on all languages. Overall, our results show that negative interference is more common than previously known, suggesting new directions for improving multilingual representations. △ Less

Submitted 6 October, 2020; originally announced October 2020.

Comments: Published as a main conference paper at EMNLP 2020

arXiv:2010.02114 [pdf, other]

Explaining The Efficacy of Counterfactually Augmented Data

Authors: Divyansh Kaushik, Amrith Setlur, Eduard Hovy, Zachary C. Lipton

Abstract: In attempts to produce ML models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label ar… ▽ More In attempts to produce ML models less reliant on spurious patterns in NLP datasets, researchers have recently proposed curating counterfactually augmented data (CAD) via a human-in-the-loop process in which given some documents and their (initial) labels, humans must revise the text to make a counterfactual label applicable. Importantly, edits that are not necessary to flip the applicable label are prohibited. Models trained on the augmented data appear, empirically, to rely less on semantically irrelevant words and to generalize better out of domain. While this work draws loosely on causal thinking, the underlying causal model (even at an abstract level) and the principles underlying the observed out-of-domain improvements remain unclear. In this paper, we introduce a toy analog based on linear Gaussian models, observing interesting relationships between causal models, measurement noise, out-of-domain generalization, and reliance on spurious signals. Our analysis provides some insights that help to explain the efficacy of CAD. Moreover, we develop the hypothesis that while adding noise to causal features should degrade both in-domain and out-of-domain performance, adding noise to non-causal features should lead to relative improvements in out-of-domain performance. This idea inspires a speculative test for determining whether a feature attribution technique has identified the causal spans. If adding noise (e.g., by random word flips) to the highlighted spans degrades both in-domain and out-of-domain performance on a battery of challenge datasets, but adding noise to the complement gives improvements out-of-domain, it suggests we have identified causal spans. We present a large-scale empirical study comparing spans edited to create CAD to those selected by attention and saliency maps. Across numerous domains and models, we find that the hypothesized phenomenon is pronounced for CAD. △ Less

Submitted 23 March, 2021; v1 submitted 5 October, 2020; originally announced October 2020.

Comments: Published at ICLR 2021

arXiv:2007.07151 [pdf, other]

Extracting Structured Data from Physician-Patient Conversations By Predicting Noteworthy Utterances

Authors: Kundan Krishna, Amy Pavel, Benjamin Schloss, Jeffrey P. Bigham, Zachary C. Lipton

Abstract: Despite diverse efforts to mine various modalities of medical data, the conversations between physicians and patients at the time of care remain an untapped source of insights. In this paper, we leverage this data to extract structured information that might assist physicians with post-visit documentation in electronic health records, potentially lightening the clerical burden. In this exploratory… ▽ More Despite diverse efforts to mine various modalities of medical data, the conversations between physicians and patients at the time of care remain an untapped source of insights. In this paper, we leverage this data to extract structured information that might assist physicians with post-visit documentation in electronic health records, potentially lightening the clerical burden. In this exploratory study, we describe a new dataset consisting of conversation transcripts, post-visit summaries, corresponding supporting evidence (in the transcript), and structured labels. We focus on the tasks of recognizing relevant diagnoses and abnormalities in the review of organ systems (RoS). One methodological challenge is that the conversations are long (around 1500 words), making it difficult for modern deep-learning models to use them as input. To address this challenge, we extract noteworthy utterances---parts of the conversation likely to be cited as evidence supporting some summary sentence. We find that by first filtering for (predicted) noteworthy utterances, we can significantly boost predictive performance for recognizing both diagnoses and RoS abnormalities. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2007.04082 [pdf, other]

Uncertainty-Aware Lookahead Factor Models for Quantitative Investing

Authors: Lakshay Chauhan, John Alberg, Zachary C. Lipton

Abstract: On a periodic basis, publicly traded companies report fundamentals, financial data including revenue, earnings, debt, among others. Quantitative finance research has identified several factors, functions of the reported data that historically correlate with stock market performance. In this paper, we first show through simulation that if we could select stocks via factors calculated on future fund… ▽ More On a periodic basis, publicly traded companies report fundamentals, financial data including revenue, earnings, debt, among others. Quantitative finance research has identified several factors, functions of the reported data that historically correlate with stock market performance. In this paper, we first show through simulation that if we could select stocks via factors calculated on future fundamentals (via oracle), that our portfolios would far outperform standard factor models. Motivated by this insight, we train deep nets to forecast future fundamentals from a trailing 5-year history. We propose lookahead factor models which plug these predicted future fundamentals into traditional factors. Finally, we incorporate uncertainty estimates from both neural heteroscedastic regression and a dropout-based heuristic, improving performance by adjusting our portfolios to avert risk. In retrospective analysis, we leverage an industry-grade portfolio simulator (backtester) to show simultaneous improvement in annualized return and Sharpe ratio. Specifically, the simulated annualized return for the uncertainty-aware model is 17.7% (vs 14.0% for a standard factor model) and the Sharpe ratio is 0.84 (vs 0.52). △ Less

Submitted 15 July, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

arXiv:2006.01898 [pdf, other]

Predicting Mortality Risk in Viral and Unspecified Pneumonia to Assist Clinicians with COVID-19 ECMO Planning

Authors: Helen Zhou, Cheng Cheng, Zachary C. Lipton, George H. Chen, Jeremy C. Weiss

Abstract: Respiratory complications due to coronavirus disease COVID-19 have claimed tens of thousands of lives in 2020. Many cases of COVID-19 escalate from Severe Acute Respiratory Syndrome (SARS-CoV-2) to viral pneumonia to acute respiratory distress syndrome (ARDS) to death. Extracorporeal membranous oxygenation (ECMO) is a life-sustaining oxygenation and ventilation therapy that may be used for patient… ▽ More Respiratory complications due to coronavirus disease COVID-19 have claimed tens of thousands of lives in 2020. Many cases of COVID-19 escalate from Severe Acute Respiratory Syndrome (SARS-CoV-2) to viral pneumonia to acute respiratory distress syndrome (ARDS) to death. Extracorporeal membranous oxygenation (ECMO) is a life-sustaining oxygenation and ventilation therapy that may be used for patients with severe ARDS when mechanical ventilation is insufficient to sustain life. While early planning and surgical cannulation for ECMO can increase survival, clinicians report the lack of a risk score hinders these efforts. In this work, we leverage machine learning techniques to develop the PEER score, used to highlight critically ill patients with viral or unspecified pneumonia at high risk of mortality or decompensation in a subpopulation eligible for ECMO. The PEER score is validated on two large, publicly available critical care databases and predicts mortality at least as well as other existing risk scores. Stratifying our cohorts into low-risk and high-risk groups, we find that the high-risk group also has a higher proportion of decompensation indicators such as vasopressor and ventilator use. Finally, the PEER score is provided in the form of a nomogram for direct calculation of patient risk, and can be used to highlight at-risk patients among critical care patients eligible for ECMO. △ Less

Submitted 2 June, 2020; originally announced June 2020.

arXiv:2005.01795 [pdf, other]

Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques

Authors: Kundan Krishna, Sopan Khosla, Jeffrey P. Bigham, Zachary C. Lipton

Abstract: Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between phy… ▽ More Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between physicians and patients. After exploring a spectrum of methods across the extractive-abstractive spectrum, we propose Cluster2Sent, an algorithm that (i) extracts important utterances relevant to each summary section; (ii) clusters together related utterances; and then (iii) generates one summary sentence per cluster. Cluster2Sent outperforms its purely abstractive counterpart by 8 ROUGE-1 points, and produces significantly more factual and coherent sentences as assessed by expert human evaluators. For reproducibility, we demonstrate similar benefits on the publicly available AMI dataset. Our results speak to the benefits of structuring summaries into sections and annotating supporting evidence when constructing summarization corpora. △ Less

Submitted 2 June, 2021; v1 submitted 4 May, 2020; originally announced May 2020.

Comments: Published at ACL 2021 Main Conference

arXiv:2003.11991 [pdf, ps, other]

Estimating Treatment Effects with Observed Confounders and Mediators

Authors: Shantanu Gupta, Zachary C. Lipton, David Childers

Abstract: Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor fo… ▽ More Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor formula applies when an observed mediator transmits the causal effect. In this paper, we investigate the over-identified scenario where both confounders and mediators are observed, rendering both estimators valid. Addressing the linear Gaussian causal model, we demonstrate that either estimator can dominate the other by an unbounded constant factor. Next, we derive an optimal estimator, which leverages all observed variables, and bound its finite-sample variance. We show that it strictly outperforms the backdoor and frontdoor estimators and that this improvement can be unbounded. We also present a procedure for combining two datasets, one with observed confounders and another with observed mediators. Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets. △ Less

Submitted 14 June, 2021; v1 submitted 26 March, 2020; originally announced March 2020.

arXiv:2003.07554 [pdf, other]

A Unified View of Label Shift Estimation

Authors: Saurabh Garg, Yifan Wu, Sivaraman Balakrishnan, Zachary C. Lipton

Abstract: Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirical… ▽ More Under label shift, the label distribution p(y) might change but the class-conditional distributions p(x|y) do not. There are two dominant approaches for estimating the label marginal. BBSE, a moment-matching approach based on confusion matrices, is provably consistent and provides interpretable error bounds. However, a maximum likelihood estimation approach, which we call MLLS, dominates empirically. In this paper, we present a unified view of the two methods and the first theoretical characterization of MLLS. Our contributions include (i) consistency conditions for MLLS, which include calibration of the classifier and a confusion matrix invertibility condition that BBSE also requires; (ii) a unified framework, casting BBSE as roughly equivalent to MLLS for a particular choice of calibration method; and (iii) a decomposition of MLLS's finite-sample error into terms reflecting miscalibration and estimation error. Our analysis attributes BBSE's statistical inefficiency to a loss of information due to coarse calibration. Experiments on synthetic data, MNIST, and CIFAR10 support our findings. △ Less

Submitted 16 October, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

Comments: Accepted at Neurips 2020

arXiv:2002.11096 [pdf, other]

Causal Inference With Selectively Deconfounded Data

Authors: Kyra Gan, Andrew A. Li, Zachary C. Lipton, Sridhar Tayur

Abstract: Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data;(b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporati… ▽ More Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data;(b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large confounded observational dataset (confounder unobserved) alongside a small deconfounded observational dataset (confounder revealed) when estimating the ATE. Our theoretical results suggest that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases -- say, genetics -- we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer. △ Less

Submitted 6 March, 2021; v1 submitted 25 February, 2020; originally announced February 2020.

arXiv:2002.10021 [pdf, other]

How Transferable are the Representations Learned by Deep Q Agents?

Authors: Jacob Tyo, Zachary Lipton

Abstract: In this paper, we consider the source of Deep Reinforcement Learning (DRL)'s sample complexity, asking how much derives from the requirement of learning useful representations of environment states and how much is due to the sample complexity of learning a policy. While for DRL agents, the distinction between representation and policy may not be clear, we seek new insight through a set of transfer… ▽ More In this paper, we consider the source of Deep Reinforcement Learning (DRL)'s sample complexity, asking how much derives from the requirement of learning useful representations of environment states and how much is due to the sample complexity of learning a policy. While for DRL agents, the distinction between representation and policy may not be clear, we seek new insight through a set of transfer learning experiments. In each experiment, we retain some fraction of layers trained on either the same game or a related game, comparing the benefits of transfer learning to learning a policy from scratch. Interestingly, we find that benefits due to transfer are highly variable in general and non-symmetric across pairs of tasks. Our experiments suggest that perhaps transfer from simpler environments can boost performance on more complex downstream tasks and that the requirements of learning a useful representation can range from negligible to the majority of the sample complexity, based on the environment. Furthermore, we find that fine-tuning generally outperforms training with the transferred layers frozen, confirming an insight first noted in the classification setting. △ Less

Submitted 23 February, 2020; originally announced February 2020.

arXiv:2001.09773 [pdf, ps, other]

Algorithmic Fairness from a Non-ideal Perspective

Authors: Sina Fazelpour, Zachary C. Lipton

Abstract: Inspired by recent breakthroughs in predictive modeling, practitioners in both industry and government have turned to machine learning with hopes of operationalizing predictions to drive automated decisions. Unfortunately, many social desiderata concerning consequential decisions, such as justice or fairness, have no natural formulation within a purely predictive framework. In efforts to mitigate… ▽ More Inspired by recent breakthroughs in predictive modeling, practitioners in both industry and government have turned to machine learning with hopes of operationalizing predictions to drive automated decisions. Unfortunately, many social desiderata concerning consequential decisions, such as justice or fairness, have no natural formulation within a purely predictive framework. In efforts to mitigate these problems, researchers have proposed a variety of metrics for quantifying deviations from various statistical parities that we might expect to observe in a fair world and offered a variety of algorithms in attempts to satisfy subsets of these parities or to trade off the degree to which they are satisfied against utility. In this paper, we connect this approach to \emph{fair machine learning} to the literature on ideal and non-ideal methodological approaches in political philosophy. The ideal approach requires positing the principles according to which a just world would operate. In the most straightforward application of ideal theory, one supports a proposed policy by arguing that it closes a discrepancy between the real and the perfectly just world. However, by failing to account for the mechanisms by which our non-ideal world arose, the responsibilities of various decision-makers, and the impacts of proposed policies, naive applications of ideal thinking can lead to misguided interventions. In this paper, we demonstrate a connection between the fair machine learning literature and the ideal approach in political philosophy, and argue that the increasingly apparent shortcomings of proposed fair machine learning algorithms reflect broader troubles faced by the ideal approach. We conclude with a critical discussion of the harms of misguided solutions, a reinterpretation of impossibility results, and directions for future research. △ Less

Submitted 8 January, 2020; originally announced January 2020.

Comments: Accepted for publication at the AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society (AIES) 2020

arXiv:1912.06074 [pdf, other]

Game Design for Eliciting Distinguishable Behavior

Authors: Fan Yang, Liu Leqi, Yifan Wu, Zachary C. Lipton, Pradeep Ravikumar, William W. Cohen, Tom Mitchell

Abstract: The ability to inferring latent psychological traits from human behavior is key to develo** personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing… ▽ More The ability to inferring latent psychological traits from human behavior is key to develo** personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing \emph{behavior diagnostic games} that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin. △ Less

Submitted 12 December, 2019; originally announced December 2019.

Comments: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019)

arXiv:1910.08640 [pdf, other]

Are Perceptually-Aligned Gradients a General Property of Robust Classifiers?

Authors: Simran Kaur, Jeremy Cohen, Zachary C. Lipton

Abstract: For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "per… ▽ More For a standard convolutional neural network, optimizing over the input pixels to maximize the score of some target class will generally produce a grainy-looking version of the original image. However, Santurkar et al. (2019) demonstrated that for adversarially-trained neural networks, this optimization produces images that uncannily resemble the target class. In this paper, we show that these "perceptually-aligned gradients" also occur under randomized smoothing, an alternative means of constructing adversarially-robust classifiers. Our finding supports the hypothesis that perceptually-aligned gradients may be a general property of robust classifiers. We hope that our results will inspire research aimed at explaining this link between perceptually-aligned gradients and adversarial robustness. △ Less

Submitted 23 October, 2019; v1 submitted 18 October, 2019; originally announced October 2019.

Comments: To appear in the "Science Meets Engineering of Deep Learning" Workshop at NeurIPS 2019

arXiv:1910.00762 [pdf, other]

Accelerating Deep Learning by Focusing on the Biggest Losers

Authors: Angela H. Jiang, Daniel L. -K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, Padmanabhan Pillai

Abstract: This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of co… ▽ More This paper introduces Selective-Backprop, a technique that accelerates the training of deep neural networks (DNNs) by prioritizing examples with high loss at each iteration. Selective-Backprop uses the output of a training example's forward pass to decide whether to use that example to compute gradients and update parameters, or to skip immediately to the next example. By reducing the number of computationally-expensive backpropagation steps performed, Selective-Backprop accelerates training. Evaluation on CIFAR10, CIFAR100, and SVHN, across a variety of modern image models, shows that Selective-Backprop converges to target error rates up to 3.5x faster than with standard SGD and between 1.02--1.8x faster than a state-of-the-art importance sampling approach. Further acceleration of 26% can be achieved by using stale forward pass results for selection, thus also skip** forward passes of low priority examples. △ Less

Submitted 1 October, 2019; originally announced October 2019.

arXiv:1909.12434 [pdf, other]

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

Authors: Divyansh Kaushik, Eduard Hovy, Zachary C. Lipton

Abstract: Despite alarm over the reliance of machine learning systems on so-called spurious patterns, the term lacks coherent meaning in standard statistical frameworks. However, the language of causality offers clarity: spurious associations are due to confounding (e.g., a common cause), but not direct or indirect causal effects. In this paper, we focus on natural language processing, introducing methods a… ▽ More Despite alarm over the reliance of machine learning systems on so-called spurious patterns, the term lacks coherent meaning in standard statistical frameworks. However, the language of causality offers clarity: spurious associations are due to confounding (e.g., a common cause), but not direct or indirect causal effects. In this paper, we focus on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns. Given documents and their initial labels, we task humans with revising each document so that it (i) accords with a counterfactual target label; (ii) retains internal coherence; and (iii) avoids unnecessary changes. Interestingly, on sentiment analysis and natural language inference tasks, classifiers trained on original data fail on their counterfactually-revised counterparts and vice versa. Classifiers trained on combined datasets perform remarkably well, just shy of those specialized to either domain. While classifiers trained on either original or manipulated data alone are sensitive to spurious features (e.g., mentions of genre), models trained on the combined data are less sensitive to this signal. Both datasets are publicly available. △ Less

Submitted 14 February, 2020; v1 submitted 26 September, 2019; originally announced September 2019.

Comments: Published at ICLR 2020

arXiv:1909.07913 [pdf, other]

Learning to Deceive with Attention-Based Explanations

Authors: Danish Pruthi, Mansi Gupta, Bhuwan Dhingra, Graham Neubig, Zachary C. Lipton

Abstract: Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mec… ▽ More Attention mechanisms are ubiquitous components in neural architectures applied to natural language processing. In addition to yielding gains in predictive accuracy, attention weights are often claimed to confer interpretability, purportedly useful both for providing insights to practitioners and for explaining why a model makes its decisions to stakeholders. We call the latter use of attention mechanisms into question by demonstrating a simple method for training models to produce deceptive attention masks. Our method diminishes the total weight assigned to designated impermissible tokens, even when the models can be shown to nevertheless rely on these features to drive predictions. Across multiple models and tasks, our approach manipulates attention weights while paying surprisingly little cost in accuracy. Through a human study, we show that our manipulated attention-based explanations deceive people into thinking that predictions from a model biased against gender minorities do not rely on the gender. Consequently, our results cast doubt on attention's reliability as a tool for auditing algorithms in the context of fairness and accountability. △ Less

Submitted 6 April, 2020; v1 submitted 17 September, 2019; originally announced September 2019.

Comments: Accepted to ACL 2020 as a long paper. Updated version

arXiv:1909.05356 [pdf, other]

Entity Projection via Machine Translation for Cross-Lingual NER

Authors: Alankar Jain, Bhargavi Paranjape, Zachary C. Lipton

Abstract: Although over 100 languages are supported by strong off-the-shelf machine translation systems, only a subset of them possess large annotated corpora for named entity recognition. Motivated by this fact, we leverage machine translation to improve annotation-projection approaches to cross-lingual named entity recognition. We propose a system that improves over prior entity-projection methods by: (a)… ▽ More Although over 100 languages are supported by strong off-the-shelf machine translation systems, only a subset of them possess large annotated corpora for named entity recognition. Motivated by this fact, we leverage machine translation to improve annotation-projection approaches to cross-lingual named entity recognition. We propose a system that improves over prior entity-projection methods by: (a) leveraging machine translation systems twice: first for translating sentences and subsequently for translating entities; (b) matching entities based on orthographic and phonetic similarity; and (c) identifying matches based on distributional statistics derived from the dataset. Our approach improves upon current state-of-the-art methods for cross-lingual named entity recognition on 5 diverse languages by an average of 4.1 points. Further, our method achieves state-of-the-art F_1 scores for Armenian, outperforming even a monolingual model trained on Armenian source data. △ Less

Submitted 13 September, 2019; v1 submitted 31 August, 2019; originally announced September 2019.

arXiv:1908.04364 [pdf, other]

AmazonQA: A Review-Based Question Answering Task

Authors: Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, Zachary C Lipton

Abstract: Every day, thousands of customers post questions on Amazon product pages. After some time, if they are fortunate, a knowledgeable customer might answer their question. Observing that many questions can be answered based upon the available product reviews, we propose the task of review-based QA. Given a corpus of reviews and a question, the QA system synthesizes an answer. To this end, we introduce… ▽ More Every day, thousands of customers post questions on Amazon product pages. After some time, if they are fortunate, a knowledgeable customer might answer their question. Observing that many questions can be answered based upon the available product reviews, we propose the task of review-based QA. Given a corpus of reviews and a question, the QA system synthesizes an answer. To this end, we introduce a new dataset and propose a method that combines information retrieval techniques for selecting relevant reviews (given a question) and "reading comprehension" models for synthesizing an answer (given a question and review). Our dataset consists of 923k questions, 3.6M answers and 14M reviews across 156k products. Building on the well-known Amazon dataset, we collect additional annotations, marking each question as either answerable or unanswerable based on the available reviews. A deployed system could first classify a question as answerable and then attempt to generate an answer. Notably, unlike many popular QA datasets, here, the questions, passages, and answers are all extracted from real human interactions. We evaluate numerous models for answer generation and propose strong baselines, demonstrating the challenging nature of this new task. △ Less

Submitted 20 August, 2019; v1 submitted 12 August, 2019; originally announced August 2019.

Comments: 8 pages, 7 figures; IJCAI-19; first three authors contribute equally. Data and code available at https://github.com/amazonqa/amazonqa

Showing 51–100 of 148 results for author: Lipton, Z