-
WILDS: A Benchmark of in-the-Wild Distribution Shifts
Authors:
Pang Wei Koh,
Shiori Sagawa,
Henrik Marklund,
Sang Michael Xie,
Marvin Zhang,
Akshay Balsubramani,
Weihua Hu,
Michihiro Yasunaga,
Richard Lanas Phillips,
Irena Gao,
Tony Lee,
Etienne David,
Ian Stavness,
Wei Guo,
Berton A. Earnshaw,
Imran S. Haque,
Sara Beery,
Jure Leskovec,
Anshul Kundaje,
Emma Pierson,
Sergey Levine,
Chelsea Finn,
Percy Liang
Abstract:
Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchma…
▽ More
Distribution shifts -- where the training distribution differs from the test distribution -- can substantially degrade the accuracy of machine learning (ML) systems deployed in the wild. Despite their ubiquity in the real-world deployments, these distribution shifts are under-represented in the datasets widely used in the ML community today. To address this gap, we present WILDS, a curated benchmark of 10 datasets reflecting a diverse range of distribution shifts that naturally arise in real-world applications, such as shifts across hospitals for tumor identification; across camera traps for wildlife monitoring; and across time and location in satellite imaging and poverty map**. On each dataset, we show that standard training yields substantially lower out-of-distribution than in-distribution performance. This gap remains even with models trained by existing methods for tackling distribution shifts, underscoring the need for new methods for training models that are more robust to the types of distribution shifts that arise in practice. To facilitate method development, we provide an open-source package that automates dataset loading, contains default model architectures and hyperparameters, and standardizes evaluations. Code and leaderboards are available at https://wilds.stanford.edu.
△ Less
Submitted 16 July, 2021; v1 submitted 14 December, 2020;
originally announced December 2020.
-
p-value peeking and estimating extrema
Authors:
Akshay Balsubramani
Abstract:
A pervasive issue in statistical hypothesis testing is that the reported $p$-values are biased downward by data "peeking" -- the practice of reporting only progressively extreme values of the test statistic as more data samples are collected. We develop principled mechanisms to estimate such running extrema of test statistics, which directly address the effect of peeking in some general scenarios.
A pervasive issue in statistical hypothesis testing is that the reported $p$-values are biased downward by data "peeking" -- the practice of reporting only progressively extreme values of the test statistic as more data samples are collected. We develop principled mechanisms to estimate such running extrema of test statistics, which directly address the effect of peeking in some general scenarios.
△ Less
Submitted 2 November, 2020;
originally announced November 2020.
-
Sharp finite-sample concentration of independent variables
Authors:
Akshay Balsubramani
Abstract:
We show an extension of Sanov's theorem on large deviations, controlling the tail probabilities of i.i.d. random variables with matching concentration and anti-concentration bounds. This result has a general scope, applies to samples of any size, and has a short information-theoretic proof using elementary techniques.
We show an extension of Sanov's theorem on large deviations, controlling the tail probabilities of i.i.d. random variables with matching concentration and anti-concentration bounds. This result has a general scope, applies to samples of any size, and has a short information-theoretic proof using elementary techniques.
△ Less
Submitted 8 October, 2021; v1 submitted 30 August, 2020;
originally announced August 2020.
-
Learning transport cost from subset correspondence
Authors:
Ruishan Liu,
Akshay Balsubramani,
James Zou
Abstract:
Learning to align multiple datasets is an important problem with many applications, and it is especially useful when we need to integrate multiple experiments or correct for confounding. Optimal transport (OT) is a principled approach to align datasets, but a key challenge in applying OT is that we need to specify a transport cost function that accurately captures how the two datasets are related.…
▽ More
Learning to align multiple datasets is an important problem with many applications, and it is especially useful when we need to integrate multiple experiments or correct for confounding. Optimal transport (OT) is a principled approach to align datasets, but a key challenge in applying OT is that we need to specify a transport cost function that accurately captures how the two datasets are related. Reliable cost functions are typically not available and practitioners often resort to using hand-crafted or Euclidean cost even if it may not be appropriate. In this work, we investigate how to learn the cost function using a small amount of side information which is often available. The side information we consider captures subset correspondence -- i.e. certain subsets of points in the two data sets are known to be related. For example, we may have some images labeled as cars in both datasets; or we may have a common annotated cell type in single-cell data from two batches. We develop an end-to-end optimizer (OT-SI) that differentiates through the Sinkhorn algorithm and effectively learns the suitable cost function from side information. On systematic experiments in images, marriage-matching and single-cell RNA-seq, our method substantially outperform state-of-the-art benchmarks.
△ Less
Submitted 30 July, 2021; v1 submitted 29 September, 2019;
originally announced September 2019.
-
An adaptive nearest neighbor rule for classification
Authors:
Akshay Balsubramani,
Sanjoy Dasgupta,
Yoav Freund,
Shay Moran
Abstract:
We introduce a variant of the $k$-nearest neighbor classifier in which $k$ is chosen adaptively for each query, rather than supplied as a parameter. The choice of $k$ depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger $k$ for predicting the labels of points in noisy regions.)
We provide theory and…
▽ More
We introduce a variant of the $k$-nearest neighbor classifier in which $k$ is chosen adaptively for each query, rather than supplied as a parameter. The choice of $k$ depends on properties of each neighborhood, and therefore may significantly vary between different points. (For example, the algorithm will use larger $k$ for predicting the labels of points in noisy regions.)
We provide theory and experiments that demonstrate that the algorithm performs comparably to, and sometimes better than, $k$-NN with an optimal choice of $k$. In particular, we derive bounds on the convergence rates of our classifier that depend on a local quantity we call the `advantage' which is significantly weaker than the Lipschitz conditions used in previous convergence rate proofs. These generalization bounds hinge on a variant of the seminal Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant concerns conditional probabilities and may be of independent interest.
△ Less
Submitted 29 May, 2019;
originally announced May 2019.
-
Linking Generative Adversarial Learning and Binary Classification
Authors:
Akshay Balsubramani
Abstract:
In this note, we point out a basic link between generative adversarial (GA) training and binary classification -- any powerful discriminator essentially computes an (f-)divergence between real and generated samples. The result, repeatedly re-derived in decision theory, has implications for GA Networks (GANs), providing an alternative perspective on training f-GANs by designing the discriminator lo…
▽ More
In this note, we point out a basic link between generative adversarial (GA) training and binary classification -- any powerful discriminator essentially computes an (f-)divergence between real and generated samples. The result, repeatedly re-derived in decision theory, has implications for GA Networks (GANs), providing an alternative perspective on training f-GANs by designing the discriminator loss function.
△ Less
Submitted 5 September, 2017;
originally announced September 2017.
-
Semantically Decomposing the Latent Spaces of Generative Adversarial Networks
Authors:
Chris Donahue,
Zachary C. Lipton,
Akshay Balsubramani,
Julian McAuley
Abstract:
We propose a new algorithm for training generative adversarial networks that jointly learns latent codes for both identities (e.g. individual humans) and observations (e.g. specific photographs). By fixing the identity portion of the latent codes, we can generate diverse images of the same subject, and by fixing the observation portion, we can traverse the manifold of subjects while maintaining co…
▽ More
We propose a new algorithm for training generative adversarial networks that jointly learns latent codes for both identities (e.g. individual humans) and observations (e.g. specific photographs). By fixing the identity portion of the latent codes, we can generate diverse images of the same subject, and by fixing the observation portion, we can traverse the manifold of subjects while maintaining contingent aspects such as lighting and pose. Our algorithm features a pairwise training scheme in which each sample from the generator consists of two images with a common identity code. Corresponding samples from the real dataset consist of two distinct photographs of the same subject. In order to fool the discriminator, the generator must produce pairs that are photorealistic, distinct, and appear to depict the same individual. We augment both the DCGAN and BEGAN approaches with Siamese discriminators to facilitate pairwise training. Experiments with human judges and an off-the-shelf face verification system demonstrate our algorithm's ability to generate convincing, identity-matched photographs.
△ Less
Submitted 22 February, 2018; v1 submitted 22 May, 2017;
originally announced May 2017.
-
Optimal Binary Autoencoding with Pairwise Correlations
Authors:
Akshay Balsubramani
Abstract:
We formulate learning of a binary autoencoder as a biconvex optimization problem which learns from the pairwise correlations between encoded and decoded bits. Among all possible algorithms that use this information, ours finds the autoencoder that reconstructs its inputs with worst-case optimal loss. The optimal decoder is a single layer of artificial neurons, emerging entirely from the minimax lo…
▽ More
We formulate learning of a binary autoencoder as a biconvex optimization problem which learns from the pairwise correlations between encoded and decoded bits. Among all possible algorithms that use this information, ours finds the autoencoder that reconstructs its inputs with worst-case optimal loss. The optimal decoder is a single layer of artificial neurons, emerging entirely from the minimax loss minimization, and with weights learned by convex optimization. All this is reflected in competitive experimental results, demonstrating that binary autoencoding can be done efficiently by conveying information in pairwise correlations in an optimal fashion.
△ Less
Submitted 7 November, 2016;
originally announced November 2016.
-
Muffled Semi-Supervised Learning
Authors:
Akshay Balsubramani,
Yoav Freund
Abstract:
We explore a novel approach to semi-supervised learning. This approach is contrary to the common approach in that the unlabeled examples serve to "muffle," rather than enhance, the guidance provided by the labeled examples. We provide several variants of the basic algorithm and show experimentally that they can achieve significantly higher AUC than boosted trees, random forests and logistic regres…
▽ More
We explore a novel approach to semi-supervised learning. This approach is contrary to the common approach in that the unlabeled examples serve to "muffle," rather than enhance, the guidance provided by the labeled examples. We provide several variants of the basic algorithm and show experimentally that they can achieve significantly higher AUC than boosted trees, random forests and logistic regression when unlabeled examples are available.
△ Less
Submitted 27 May, 2016;
originally announced May 2016.
-
Learning to Abstain from Binary Prediction
Authors:
Akshay Balsubramani
Abstract:
A binary classifier capable of abstaining from making a label prediction has two goals in tension: minimizing errors, and avoiding abstaining unnecessarily often. In this work, we exactly characterize the best achievable tradeoff between these two goals in a general semi-supervised setting, given an ensemble of predictors of varying competence as well as unlabeled data on which we wish to predict…
▽ More
A binary classifier capable of abstaining from making a label prediction has two goals in tension: minimizing errors, and avoiding abstaining unnecessarily often. In this work, we exactly characterize the best achievable tradeoff between these two goals in a general semi-supervised setting, given an ensemble of predictors of varying competence as well as unlabeled data on which we wish to predict or abstain. We give an algorithm for learning a classifier in this setting which trades off its errors with abstentions in a minimax optimal manner, is as efficient as linear learning and prediction, and is demonstrably practical. Our analysis extends to a large class of loss functions and other scenarios, including ensembles comprised of specialists that can themselves abstain.
△ Less
Submitted 29 November, 2016; v1 submitted 25 February, 2016;
originally announced February 2016.
-
The Utility of Abstaining in Binary Classification
Authors:
Akshay Balsubramani
Abstract:
We explore the problem of binary classification in machine learning, with a twist - the classifier is allowed to abstain on any datum, professing ignorance about the true class label without committing to any prediction. This is directly motivated by applications like medical diagnosis and fraud risk assessment, in which incorrect predictions have potentially calamitous consequences. We focus on a…
▽ More
We explore the problem of binary classification in machine learning, with a twist - the classifier is allowed to abstain on any datum, professing ignorance about the true class label without committing to any prediction. This is directly motivated by applications like medical diagnosis and fraud risk assessment, in which incorrect predictions have potentially calamitous consequences. We focus on a recent spate of theoretically driven work in this area that characterizes how allowing abstentions can lead to fewer errors in very general settings. Two areas are highlighted: the surprising possibility of zero-error learning, and the fundamental tradeoff between predicting sufficiently often and avoiding incorrect predictions. We review efficient algorithms with provable guarantees for each of these areas. We also discuss connections to other scenarios, notably active learning, as they suggest promising directions of further inquiry in this emerging field.
△ Less
Submitted 26 December, 2015;
originally announced December 2015.
-
Optimal Binary Classifier Aggregation for General Losses
Authors:
Akshay Balsubramani,
Yoav Freund
Abstract:
We address the problem of aggregating an ensemble of predictors with known loss bounds in a semi-supervised binary classification setting, to minimize prediction loss incurred on the unlabeled data. We find the minimax optimal predictions for a very general class of loss functions including all convex and many non-convex losses, extending a recent analysis of the problem for misclassification erro…
▽ More
We address the problem of aggregating an ensemble of predictors with known loss bounds in a semi-supervised binary classification setting, to minimize prediction loss incurred on the unlabeled data. We find the minimax optimal predictions for a very general class of loss functions including all convex and many non-convex losses, extending a recent analysis of the problem for misclassification error. The result is a family of semi-supervised ensemble aggregation algorithms which are as efficient as linear learning by convex optimization, but are minimax optimal without any relaxations. Their decision rules take a form familiar in decision theory -- applying sigmoid functions to a notion of ensemble margin -- without the assumptions typically made in margin-based learning.
△ Less
Submitted 7 November, 2016; v1 submitted 1 October, 2015;
originally announced October 2015.
-
PAC-Bayes Iterated Logarithm Bounds for Martingale Mixtures
Authors:
Akshay Balsubramani
Abstract:
We give tight concentration bounds for mixtures of martingales that are simultaneously uniform over (a) mixture distributions, in a PAC-Bayes sense; and (b) all finite times. These bounds are proved in terms of the martingale variance, extending classical Bernstein inequalities, and sharpening and simplifying prior work.
We give tight concentration bounds for mixtures of martingales that are simultaneously uniform over (a) mixture distributions, in a PAC-Bayes sense; and (b) all finite times. These bounds are proved in terms of the martingale variance, extending classical Bernstein inequalities, and sharpening and simplifying prior work.
△ Less
Submitted 22 June, 2015;
originally announced June 2015.
-
Scalable Semi-Supervised Aggregation of Classifiers
Authors:
Akshay Balsubramani,
Yoav Freund
Abstract:
We present and empirically evaluate an efficient algorithm that learns to aggregate the predictions of an ensemble of binary classifiers. The algorithm uses the structure of the ensemble predictions on unlabeled data to yield significant performance improvements. It does this without making assumptions on the structure or origin of the ensemble, without parameters, and as scalably as linear learni…
▽ More
We present and empirically evaluate an efficient algorithm that learns to aggregate the predictions of an ensemble of binary classifiers. The algorithm uses the structure of the ensemble predictions on unlabeled data to yield significant performance improvements. It does this without making assumptions on the structure or origin of the ensemble, without parameters, and as scalably as linear learning. We empirically demonstrate these performance gains with random forests.
△ Less
Submitted 10 November, 2015; v1 submitted 18 June, 2015;
originally announced June 2015.
-
Sequential Nonparametric Testing with the Law of the Iterated Logarithm
Authors:
Akshay Balsubramani,
Aaditya Ramdas
Abstract:
We propose a new algorithmic framework for sequential hypothesis testing with i.i.d. data, which includes A/B testing, nonparametric two-sample testing, and independence testing as special cases. It is novel in several ways: (a) it takes linear time and constant space to compute on the fly, (b) it has the same power guarantee as a non-sequential version of the test with the same computational cons…
▽ More
We propose a new algorithmic framework for sequential hypothesis testing with i.i.d. data, which includes A/B testing, nonparametric two-sample testing, and independence testing as special cases. It is novel in several ways: (a) it takes linear time and constant space to compute on the fly, (b) it has the same power guarantee as a non-sequential version of the test with the same computational constraints up to a small factor, and (c) it accesses only as many samples as are required - its stop** time adapts to the unknown difficulty of the problem. All our test statistics are constructed to be zero-mean martingales under the null hypothesis, and the rejection threshold is governed by a uniform non-asymptotic law of the iterated logarithm (LIL). For the case of nonparametric two-sample mean testing, we also provide a finite sample power analysis, and the first non-asymptotic stop** time calculations for this class of problems. We verify our predictions for type I and II errors and stop** times using simulations.
△ Less
Submitted 1 March, 2016; v1 submitted 10 June, 2015;
originally announced June 2015.
-
Optimally Combining Classifiers Using Unlabeled Data
Authors:
Akshay Balsubramani,
Yoav Freund
Abstract:
We develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifiers…
▽ More
We develop a worst-case analysis of aggregation of classifier ensembles for binary classification. The task of predicting to minimize error is formulated as a game played over a given set of unlabeled data (a transductive setting), where prior label information is encoded as constraints on the game. The minimax solution of this game identifies cases where a weighted combination of the classifiers can perform significantly better than any single classifier.
△ Less
Submitted 18 June, 2015; v1 submitted 5 March, 2015;
originally announced March 2015.
-
PAC-Bayes with Minimax for Confidence-Rated Transduction
Authors:
Akshay Balsubramani,
Yoav Freund
Abstract:
We consider using an ensemble of binary classifiers for transductive prediction, when unlabeled test data are known in advance. We derive minimax optimal rules for confidence-rated prediction in this setting. By using PAC-Bayes analysis on these rules, we obtain data-dependent performance guarantees without distributional assumptions on the data. Our analysis techniques are readily extended to a s…
▽ More
We consider using an ensemble of binary classifiers for transductive prediction, when unlabeled test data are known in advance. We derive minimax optimal rules for confidence-rated prediction in this setting. By using PAC-Bayes analysis on these rules, we obtain data-dependent performance guarantees without distributional assumptions on the data. Our analysis techniques are readily extended to a setting in which the predictor is allowed to abstain.
△ Less
Submitted 15 January, 2015;
originally announced January 2015.
-
The Fast Convergence of Incremental PCA
Authors:
Akshay Balsubramani,
Sanjoy Dasgupta,
Yoav Freund
Abstract:
We consider a situation in which we see samples in $\mathbb{R}^d$ drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schem…
▽ More
We consider a situation in which we see samples in $\mathbb{R}^d$ drawn i.i.d. from some distribution with mean zero and unknown covariance A. We wish to compute the top eigenvector of A in an incremental fashion - with an algorithm that maintains an estimate of the top eigenvector in O(d) space, and incrementally adjusts the estimate with each new data point that arrives. Two classical such schemes are due to Krasulina (1969) and Oja (1983). We give finite-sample convergence rates for both.
△ Less
Submitted 15 January, 2015;
originally announced January 2015.
-
Sharp Finite-Time Iterated-Logarithm Martingale Concentration
Authors:
Akshay Balsubramani
Abstract:
We give concentration bounds for martingales that are uniform over finite times and extend classical Hoeffding and Bernstein inequalities. We also demonstrate our concentration bounds to be optimal with a matching anti-concentration inequality, proved using the same method. Together these constitute a finite-time version of the law of the iterated logarithm, and shed light on the relationship betw…
▽ More
We give concentration bounds for martingales that are uniform over finite times and extend classical Hoeffding and Bernstein inequalities. We also demonstrate our concentration bounds to be optimal with a matching anti-concentration inequality, proved using the same method. Together these constitute a finite-time version of the law of the iterated logarithm, and shed light on the relationship between it and the central limit theorem.
△ Less
Submitted 1 December, 2015; v1 submitted 12 May, 2014;
originally announced May 2014.