Search | arXiv e-print repository

Implicit Rate-Constrained Optimization of Non-decomposable Objectives

Authors: Abhishek Kumar, Harikrishna Narasimhan, Andrew Cotter

Abstract: We consider a popular family of constrained optimization problems arising in machine learning that involve optimizing a non-decomposable evaluation metric with a certain thresholded form, while constraining another metric of interest. Examples of such problems include optimizing the false negative rate at a fixed false positive rate, optimizing precision at a fixed recall, optimizing the area unde… ▽ More We consider a popular family of constrained optimization problems arising in machine learning that involve optimizing a non-decomposable evaluation metric with a certain thresholded form, while constraining another metric of interest. Examples of such problems include optimizing the false negative rate at a fixed false positive rate, optimizing precision at a fixed recall, optimizing the area under the precision-recall or ROC curves, etc. Our key idea is to formulate a rate-constrained optimization that expresses the threshold parameter as a function of the model parameters via the Implicit Function theorem. We show how the resulting optimization problem can be solved using standard gradient based methods. Experiments on benchmark datasets demonstrate the effectiveness of our proposed method over existing state-of-the art approaches for these problems. The code for the proposed method is available at https://github.com/google-research/google-research/tree/master/implicit_constrained_optimization . △ Less

Submitted 28 July, 2021; v1 submitted 22 July, 2021; originally announced July 2021.

Comments: ICML 2021; Code available at https://github.com/google-research/google-research/tree/master/implicit_constrained_optimization

arXiv:2106.02654 [pdf, other]

Churn Reduction via Distillation

Authors: Heinrich Jiang, Harikrishna Narasimhan, Dara Bahri, Andrew Cotter, Afshin Rostamizadeh

Abstract: In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive "churn"). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoide… ▽ More In real-world systems, models are frequently updated as more data becomes available, and in addition to achieving high accuracy, the goal is to also maintain a low difference in predictions compared to the base model (i.e. predictive "churn"). If model retraining results in vastly different behavior, then it could cause negative effects in downstream systems, especially if this churn can be avoided with limited impact on model accuracy. In this paper, we show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines on a wide range of datasets and model architectures, including fully-connected networks, convolutional networks, and transformers. △ Less

Submitted 14 March, 2022; v1 submitted 4 June, 2021; originally announced June 2021.

Journal ref: ICLR 2022

arXiv:2102.06849 [pdf, other]

Distilling Double Descent

Authors: Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou

Abstract: Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that,… ▽ More Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with \emph{soft} labels, \eg probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides \emph{hard} labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches. Our explanation for this phenomenon is based on recent work on "double descent". It has been observed that, once a model's complexity roughly exceeds the amount required to memorize the training data, increasing the complexity \emph{further} can, counterintuitively, result in \emph{better} generalization. Researchers have identified several settings in which it takes place, while others have made various attempts to explain it (thus far, with only partial success). In contrast, we avoid these questions, and instead seek to \emph{exploit} this phenomenon by demonstrating that a highly-overparameterized teacher can avoid overfitting via double descent, while a student trained on a larger independent dataset labeled by this teacher will avoid overfitting due to the size of its training set. △ Less

Submitted 12 February, 2021; originally announced February 2021.

arXiv:2002.09343 [pdf, ps, other]

Robust Optimization for Fairness with Noisy Protected Groups

Authors: Serena Wang, Wenshuo Guo, Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, Michael I. Jordan

Abstract: Many existing fairness criteria for machine learning involve equalizing some metric across protected groups such as race or gender. However, practitioners trying to audit or enforce such group-based criteria can easily face the problem of noisy or biased protected group information. First, we study the consequences of naively relying on noisy protected group labels: we provide an upper bound on th… ▽ More Many existing fairness criteria for machine learning involve equalizing some metric across protected groups such as race or gender. However, practitioners trying to audit or enforce such group-based criteria can easily face the problem of noisy or biased protected group information. First, we study the consequences of naively relying on noisy protected group labels: we provide an upper bound on the fairness violations on the true groups G when the fairness criteria are satisfied on noisy groups $\hat{G}$. Second, we introduce two new approaches using robust optimization that, unlike the naive approach of only relying on $\hat{G}$, are guaranteed to satisfy fairness criteria on the true protected groups G while minimizing a training objective. We provide theoretical guarantees that one such approach converges to an optimal feasible solution. Using two case studies, we show empirically that the robust approaches achieve better true group fairness guarantees than the naive approach. △ Less

Submitted 10 November, 2020; v1 submitted 21 February, 2020; originally announced February 2020.

Comments: To appear at 34th Conference on Neural Information Processing Systems (NeurIPS 2020); first two authors contributed equally to this work

arXiv:1909.02939 [pdf, other]

Optimizing Generalized Rate Metrics through Game Equilibrium

Authors: Harikrishna Narasimhan, Andrew Cotter, Maya Gupta

Abstract: We present a general framework for solving a large class of learning problems with non-linear functions of classification rates. This includes problems where one wishes to optimize a non-decomposable performance metric such as the F-measure or G-mean, and constrained training problems where the classifier needs to satisfy non-linear rate constraints such as predictive parity fairness, distribution… ▽ More We present a general framework for solving a large class of learning problems with non-linear functions of classification rates. This includes problems where one wishes to optimize a non-decomposable performance metric such as the F-measure or G-mean, and constrained training problems where the classifier needs to satisfy non-linear rate constraints such as predictive parity fairness, distribution divergences or churn ratios. We extend previous two-player game approaches for constrained optimization to a game between three players to decouple the classifier rates from the non-linear objective, and seek to find an equilibrium of the game. Our approach generalizes many existing algorithms, and makes possible new algorithms with more flexibility and tighter handling of non-linear rate constraints. We provide convergence guarantees for convex functions of rates, and show how our methodology can be extended to handle sums of ratios of rates. Experiments on different fairness tasks confirm the efficacy of our approach. △ Less

Submitted 6 September, 2019; originally announced September 2019.

arXiv:1906.05330 [pdf, other]

Pairwise Fairness for Ranking and Regression

Authors: Harikrishna Narasimhan, Andrew Cotter, Maya Gupta, Serena Wang

Abstract: We present pairwise fairness metrics for ranking models and regression models that form analogues of statistical fairness notions such as equal opportunity, equal accuracy, and statistical parity. Our pairwise formulation supports both discrete protected groups, and continuous protected attributes. We show that the resulting training problems can be efficiently and effectively solved using existin… ▽ More We present pairwise fairness metrics for ranking models and regression models that form analogues of statistical fairness notions such as equal opportunity, equal accuracy, and statistical parity. Our pairwise formulation supports both discrete protected groups, and continuous protected attributes. We show that the resulting training problems can be efficiently and effectively solved using existing constrained optimization and robust optimization techniques developed for fair classification. Experiments illustrate the broad applicability and trade-offs of these methods. △ Less

Submitted 7 January, 2020; v1 submitted 12 June, 2019; originally announced June 2019.

arXiv:1809.04198 [pdf, other]

Optimization with Non-Differentiable Constraints with Applications to Fairness, Recall, Churn, and Other Goals

Authors: Andrew Cotter, Heinrich Jiang, Serena Wang, Taman Narayan, Maya Gupta, Seungil You, Karthik Sridharan

Abstract: We show that many machine learning goals, such as improved fairness metrics, can be expressed as constraints on the model's predictions, which we call rate constraints. We study the problem of training non-convex models subject to these rate constraints (or any non-convex and non-differentiable constraints). In the non-convex setting, the standard approach of Lagrange multipliers may fail. Further… ▽ More We show that many machine learning goals, such as improved fairness metrics, can be expressed as constraints on the model's predictions, which we call rate constraints. We study the problem of training non-convex models subject to these rate constraints (or any non-convex and non-differentiable constraints). In the non-convex setting, the standard approach of Lagrange multipliers may fail. Furthermore, if the constraints are non-differentiable, then one cannot optimize the Lagrangian with gradient-based methods. To solve these issues, we introduce the proxy-Lagrangian formulation. This new formulation leads to an algorithm that produces a stochastic classifier by playing a two-player non-zero-sum game solving for what we call a semi-coarse correlated equilibrium, which in turn corresponds to an approximately optimal and feasible solution to the constrained optimization problem. We then give a procedure which shrinks the randomized solution down to one that is a mixture of at most $m+1$ deterministic solutions, given $m$ constraints. This culminates in algorithms that can solve non-convex constrained optimization problems with possibly non-differentiable and non-convex constraints with theoretical guarantees. We provide extensive experimental results enforcing a wide range of policy goals including different fairness metrics, and other goals on accuracy, coverage, recall, and churn. △ Less

Submitted 11 September, 2018; originally announced September 2018.

arXiv:1807.00028 [pdf, other]

Training Well-Generalizing Classifiers for Fairness Metrics and Other Data-Dependent Constraints

Authors: Andrew Cotter, Maya Gupta, Heinrich Jiang, Nathan Srebro, Karthik Sridharan, Serena Wang, Blake Woodworth, Seungil You

Abstract: Classifiers can be trained with data-dependent constraints to satisfy fairness goals, reduce churn, achieve a targeted false positive rate, or other policy goals. We study the generalization performance for such constrained optimization problems, in terms of how well the constraints are satisfied at evaluation time, given that they are satisfied at training time. To improve generalization performa… ▽ More Classifiers can be trained with data-dependent constraints to satisfy fairness goals, reduce churn, achieve a targeted false positive rate, or other policy goals. We study the generalization performance for such constrained optimization problems, in terms of how well the constraints are satisfied at evaluation time, given that they are satisfied at training time. To improve generalization performance, we frame the problem as a two-player game where one player optimizes the model parameters on a training dataset, and the other player enforces the constraints on an independent validation dataset. We build on recent work in two-player constrained optimization to show that if one uses this two-dataset approach, then constraint generalization can be significantly improved. As we illustrate experimentally, this approach works not only in theory, but also in practice. △ Less

Submitted 28 September, 2018; v1 submitted 29 June, 2018; originally announced July 2018.

arXiv:1806.11212 [pdf, other]

Proxy Fairness

Authors: Maya Gupta, Andrew Cotter, Mahdi Milani Fard, Serena Wang

Abstract: We consider the problem of improving fairness when one lacks access to a dataset labeled with protected groups, making it difficult to take advantage of strategies that can improve fairness but require protected group labels, either at training or runtime. To address this, we investigate improving fairness metrics for proxy groups, and test whether doing so results in improved fairness for the tru… ▽ More We consider the problem of improving fairness when one lacks access to a dataset labeled with protected groups, making it difficult to take advantage of strategies that can improve fairness but require protected group labels, either at training or runtime. To address this, we investigate improving fairness metrics for proxy groups, and test whether doing so results in improved fairness for the true sensitive groups. Results on benchmark and real-world datasets demonstrate that such a proxy fairness strategy can work well in practice. However, we caution that the effectiveness likely depends on the choice of fairness metric, as well as how aligned the proxy groups are with the true protected groups in terms of the constrained model parameters. △ Less

Submitted 28 June, 2018; originally announced June 2018.

arXiv:1806.00050 [pdf, other]

Interpretable Set Functions

Authors: Andrew Cotter, Maya Gupta, Heinrich Jiang, James Muller, Taman Narayan, Serena Wang, Tao Zhu

Abstract: We propose learning flexible but interpretable functions that aggregate a variable-length set of permutation-invariant feature vectors to predict a label. We use a deep lattice network model so we can architect the model structure to enhance interpretability, and add monotonicity constraints between inputs-and-outputs. We then use the proposed set function to automate the engineering of dense, int… ▽ More We propose learning flexible but interpretable functions that aggregate a variable-length set of permutation-invariant feature vectors to predict a label. We use a deep lattice network model so we can architect the model structure to enhance interpretability, and add monotonicity constraints between inputs-and-outputs. We then use the proposed set function to automate the engineering of dense, interpretable features from sparse categorical features, which we call semantic feature engine. Experiments on real-world data show the achieved accuracy is similar to deep sets or deep neural networks, and is easier to debug and understand. △ Less

Submitted 31 May, 2018; originally announced June 2018.

arXiv:1804.06500 [pdf, ps, other]

Two-Player Games for Efficient Non-Convex Constrained Optimization

Authors: Andrew Cotter, Heinrich Jiang, Karthik Sridharan

Abstract: In recent years, constrained optimization has become increasingly relevant to the machine learning community, with applications including Neyman-Pearson classification, robust optimization, and fair machine learning. A natural approach to constrained optimization is to optimize the Lagrangian, but this is not guaranteed to work in the non-convex setting, and, if using a first-order method, cannot… ▽ More In recent years, constrained optimization has become increasingly relevant to the machine learning community, with applications including Neyman-Pearson classification, robust optimization, and fair machine learning. A natural approach to constrained optimization is to optimize the Lagrangian, but this is not guaranteed to work in the non-convex setting, and, if using a first-order method, cannot cope with non-differentiable constraints (e.g. constraints on rates or proportions). The Lagrangian can be interpreted as a two-player game played between a player who seeks to optimize over the model parameters, and a player who wishes to maximize over the Lagrange multipliers. We propose a non-zero-sum variant of the Lagrangian formulation that can cope with non-differentiable--even discontinuous--constraints, which we call the "proxy-Lagrangian". The first player minimizes external regret in terms of easy-to-optimize "proxy constraints", while the second player enforces the original constraints by minimizing swap regret. For this new formulation, as for the Lagrangian in the non-convex setting, the result is a stochastic classifier. For both the proxy-Lagrangian and Lagrangian formulations, however, we prove that this classifier, instead of having unbounded size, can be taken to be a distribution over no more than m+1 models (where m is the number of constraints). This is a significant improvement in practical terms. △ Less

Submitted 28 September, 2018; v1 submitted 17 April, 2018; originally announced April 2018.

arXiv:1606.07558 [pdf, ps, other]

Satisfying Real-world Goals with Dataset Constraints

Authors: Gabriel Goh, Andrew Cotter, Maya Gupta, Michael Friedlander

Abstract: The goal of minimizing misclassification error on a training set is often just one of several real-world goals that might be defined on different datasets. For example, one may require a classifier to also make positive predictions at some specified rate for some subpopulation (fairness), or to achieve a specified empirical recall. Other real-world goals include reducing churn with respect to a pr… ▽ More The goal of minimizing misclassification error on a training set is often just one of several real-world goals that might be defined on different datasets. For example, one may require a classifier to also make positive predictions at some specified rate for some subpopulation (fairness), or to achieve a specified empirical recall. Other real-world goals include reducing churn with respect to a previously deployed model, or stabilizing online training. In this paper we propose handling multiple goals on multiple datasets by training with dataset constraints, using the ramp penalty to accurately quantify costs, and present an efficient algorithm to approximately optimize the resulting non-convex constrained optimization problem. Experiments on both benchmark and real-world industry datasets demonstrate the effectiveness of our approach. △ Less

Submitted 3 May, 2017; v1 submitted 23 June, 2016; originally announced June 2016.

arXiv:1512.04960 [pdf, ps, other]

A Light Touch for Heavily Constrained SGD

Authors: Andrew Cotter, Maya Gupta, Jan Pfeifer

Abstract: Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the de… ▽ More Minimizing empirical risk subject to a set of constraints can be a useful strategy for learning restricted classes of functions, such as monotonic functions, submodular functions, classifiers that guarantee a certain class label for some subset of examples, etc. However, these restrictions may result in a very large number of constraints. Projected stochastic gradient descent (SGD) is often the default choice for large-scale optimization in machine learning, but requires a projection after each update. For heavily-constrained objectives, we propose an efficient extension of SGD that stays close to the feasible region while only applying constraints probabilistically at each iteration. Theoretical analysis shows a compelling trade-off between per-iteration work and the number of iterations needed on problems with a large number of constraints. △ Less

Submitted 24 October, 2016; v1 submitted 15 December, 2015; originally announced December 2015.

Journal ref: 29th Annual Conference on Learning Theory, pp. 729-771, 2016

arXiv:1505.06378 [pdf, other]

Monotonic Calibrated Interpolated Look-Up Tables

Authors: Maya Gupta, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov, Wojtek Moczydlowski, Alex van Esbroeck

Abstract: Real-world machine learning applications may require functions that are fast-to-evaluate and interpretable. In particular, guaranteed monotonicity of the learned function can be critical to user trust. We propose meeting these goals for low-dimensional machine learning problems by learning flexible, monotonic functions using calibrated interpolated look-up tables. We extend the structural risk min… ▽ More Real-world machine learning applications may require functions that are fast-to-evaluate and interpretable. In particular, guaranteed monotonicity of the learned function can be critical to user trust. We propose meeting these goals for low-dimensional machine learning problems by learning flexible, monotonic functions using calibrated interpolated look-up tables. We extend the structural risk minimization framework of lattice regression to train monotonic look-up tables by solving a convex problem with appropriate linear inequality constraints. In addition, we propose jointly learning interpretable calibrations of each feature to normalize continuous features and handle categorical or missing data, at the cost of making the objective non-convex. We address large-scale learning through parallelization, mini-batching, and propose random sampling of additive regularizer terms. Case studies with real-world problems with five to sixteen features and thousands to millions of training samples demonstrate the proposed monotonic functions can achieve state-of-the-art accuracy on practical problems while providing greater transparency to users. △ Less

Submitted 20 January, 2016; v1 submitted 23 May, 2015; originally announced May 2015.

Comments: To appear (with minor revisions), Journal Machine Learning Research 2016

arXiv:1308.3509 [pdf, other]

Stochastic Optimization for Machine Learning

Authors: Andrew Cotter

Abstract: It has been found that stochastic algorithms often find good solutions much more rapidly than inherently-batch approaches. Indeed, a very useful rule of thumb is that often, when solving a machine learning problem, an iterative technique which relies on performing a very large number of relatively-inexpensive updates will often outperform one which performs a smaller number of much "smarter" but c… ▽ More It has been found that stochastic algorithms often find good solutions much more rapidly than inherently-batch approaches. Indeed, a very useful rule of thumb is that often, when solving a machine learning problem, an iterative technique which relies on performing a very large number of relatively-inexpensive updates will often outperform one which performs a smaller number of much "smarter" but computationally-expensive updates. In this thesis, we will consider the application of stochastic algorithms to two of the most important machine learning problems. Part i is concerned with the supervised problem of binary classification using kernelized linear classifiers, for which the data have labels belonging to exactly two classes (e.g. "has cancer" or "doesn't have cancer"), and the learning problem is to find a linear classifier which is best at predicting the label. In Part ii, we will consider the unsupervised problem of Principal Component Analysis, for which the learning task is to find the directions which contain most of the variance of the data distribution. Our goal is to present stochastic algorithms for both problems which are, above all, practical--they work well on real-world data, in some cases better than all known competing algorithms. A secondary, but still very important, goal is to derive theoretical bounds on the performance of these algorithms which are at least competitive with, and often better than, those known for other approaches. △ Less

Submitted 15 August, 2013; originally announced August 2013.

Comments: PhD Thesis

arXiv:1307.1674 [pdf, other]

Stochastic Optimization of PCA with Capped MSG

Authors: Raman Arora, Andrew Cotter, Nathan Srebro

Abstract: We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as "Matrix Stochastic Gradient" (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. We study PCA as a stochastic optimization problem and propose a novel stochastic approximation algorithm which we refer to as "Matrix Stochastic Gradient" (MSG), as well as a practical variant, Capped MSG. We study the method both theoretically and empirically. △ Less

Submitted 5 July, 2013; originally announced July 2013.

arXiv:1204.0566 [pdf, ps, other]

The Kernelized Stochastic Batch Perceptron

Authors: Andrew Cotter, Shai Shalev-Shwartz, Nathan Srebro

Abstract: We present a novel approach for training kernel Support Vector Machines, establish learning runtime guarantees for our method that are better then those of any other known kernelized SVM optimization approach, and show that our method works well in practice compared to existing alternatives. We present a novel approach for training kernel Support Vector Machines, establish learning runtime guarantees for our method that are better then those of any other known kernelized SVM optimization approach, and show that our method works well in practice compared to existing alternatives. △ Less

Submitted 21 June, 2012; v1 submitted 2 April, 2012; originally announced April 2012.

arXiv:1109.4603 [pdf, other]

Explicit Approximations of the Gaussian Kernel

Authors: Andrew Cotter, Joseph Keshet, Nathan Srebro

Abstract: We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a bet… ▽ More We investigate training and using Gaussian kernel SVMs by approximating the kernel with an explicit finite- dimensional polynomial feature representation based on the Taylor expansion of the exponential. Although not as efficient as the recently-proposed random Fourier features [Rahimi and Recht, 2007] in terms of the number of features, we show how this polynomial representation can provide a better approximation in terms of the computational cost involved. This makes our "Taylor features" especially attractive for use on very large data sets, in conjunction with online or stochastic training. △ Less

Submitted 21 September, 2011; originally announced September 2011.

Comments: 11 pages, 2 tables, 2 figures

arXiv:1106.4574 [pdf, other]

Better Mini-Batch Algorithms via Accelerated Gradient Methods

Authors: Andrew Cotter, Ohad Shamir, Nathan Srebro, Karthik Sridharan

Abstract: Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficien… ▽ More Mini-batch algorithms have been proposed as a way to speed-up stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speed-up and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. △ Less

Submitted 22 June, 2011; originally announced June 2011.

Showing 1–19 of 19 results for author: Cotter, A