-
Auditing Privacy Mechanisms via Label Inference Attacks
Authors:
Róbert István Busa-Fekete,
Travis Dick,
Claudio Gentile,
Andrés Muñoz Medina,
Adam Smith,
Marika Swanberg
Abstract:
We propose reconstruction advantage measures to audit label privatization mechanisms. A reconstruction advantage measure quantifies the increase in an attacker's ability to infer the true label of an unlabeled example when provided with a private version of the labels in a dataset (e.g., aggregate of labels from different users or noisy labels output by randomized response), compared to an attacke…
▽ More
We propose reconstruction advantage measures to audit label privatization mechanisms. A reconstruction advantage measure quantifies the increase in an attacker's ability to infer the true label of an unlabeled example when provided with a private version of the labels in a dataset (e.g., aggregate of labels from different users or noisy labels output by randomized response), compared to an attacker that only observes the feature vectors, but may have prior knowledge of the correlation between features and labels. We consider two such auditing measures: one additive, and one multiplicative. These incorporate previous approaches taken in the literature on empirical auditing and differential privacy. The measures allow us to place a variety of proposed privatization schemes -- some differentially private, some not -- on the same footing. We analyze these measures theoretically under a distributional model which encapsulates reasonable adversarial settings. We also quantify their behavior empirically on real and simulated prediction tasks. Across a range of experimental settings, we find that differentially private schemes dominate or match the privacy-utility tradeoff of more heuristic approaches.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Better Private Linear Regression Through Better Private Feature Selection
Authors:
Travis Dick,
Jennifer Gillenwater,
Matthew Joseph
Abstract:
Existing work on differentially private linear regression typically assumes that end users can precisely set data bounds or algorithmic hyperparameters. End users often struggle to meet these requirements without directly examining the data (and violating privacy). Recent work has attempted to develop solutions that shift these burdens from users to algorithms, but they struggle to provide utility…
▽ More
Existing work on differentially private linear regression typically assumes that end users can precisely set data bounds or algorithmic hyperparameters. End users often struggle to meet these requirements without directly examining the data (and violating privacy). Recent work has attempted to develop solutions that shift these burdens from users to algorithms, but they struggle to provide utility as the feature dimension grows. This work extends these algorithms to higher-dimensional problems by introducing a differentially private feature selection method based on Kendall rank correlation. We prove a utility guarantee for the setting where features are normally distributed and conduct experiments across 25 datasets. We find that adding this private feature selection step before regression significantly broadens the applicability of ``plug-and-play'' private linear regression algorithms at little additional cost to privacy, computation, or decision-making by the end user.
△ Less
Submitted 1 June, 2023;
originally announced June 2023.
-
Measuring Re-identification Risk
Authors:
CJ Carey,
Travis Dick,
Alessandro Epasto,
Adel Javanmard,
Josh Karlin,
Shankar Kumar,
Andres Munoz Medina,
Vahab Mirrokni,
Gabriel Henrique Nunes,
Sergei Vassilvitskii,
Peilin Zhong
Abstract:
Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we…
▽ More
Compact user representations (such as embeddings) form the backbone of personalization services. In this work, we present a new theoretical framework to measure re-identification risk in such user representations. Our framework, based on hypothesis testing, formally bounds the probability that an attacker may be able to obtain the identity of a user from their representation. As an application, we show how our framework is general enough to model important real-world applications such as the Chrome's Topics API for interest-based advertising. We complement our theoretical bounds by showing provably good attack algorithms for re-identification that we use to estimate the re-identification risk in the Topics API. We believe this work provides a rigorous and interpretable notion of re-identification risk and a framework to measure it that can be used to inform real-world applications.
△ Less
Submitted 31 July, 2023; v1 submitted 12 April, 2023;
originally announced April 2023.
-
Subset-Based Instance Optimality in Private Estimation
Authors:
Travis Dick,
Alex Kulesza,
Ziteng Sun,
Ananda Theertha Suresh
Abstract:
We propose a new definition of instance optimality for differentially private estimation algorithms. Our definition requires an optimal algorithm to compete, simultaneously for every dataset $D$, with the best private benchmark algorithm that (a) knows $D$ in advance and (b) is evaluated by its worst-case performance on large subsets of $D$. That is, the benchmark algorithm need not perform well w…
▽ More
We propose a new definition of instance optimality for differentially private estimation algorithms. Our definition requires an optimal algorithm to compete, simultaneously for every dataset $D$, with the best private benchmark algorithm that (a) knows $D$ in advance and (b) is evaluated by its worst-case performance on large subsets of $D$. That is, the benchmark algorithm need not perform well when potentially extreme points are added to $D$; it only has to handle the removal of a small number of real data points that already exist. This makes our benchmark significantly stronger than those proposed in prior work. We nevertheless show, for real-valued datasets, how to construct private algorithms that achieve our notion of instance optimality when estimating a broad class of dataset properties, including means, quantiles, and $\ell_p$-norm minimizers. For means in particular, we provide a detailed analysis and show that our algorithm simultaneously matches or exceeds the asymptotic performance of existing algorithms under a range of distributional assumptions.
△ Less
Submitted 28 May, 2024; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Easy Learning from Label Proportions
Authors:
Robert Istvan Busa-Fekete,
Hee** Choi,
Travis Dick,
Claudio Gentile,
Andres Munoz medina
Abstract:
We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into "bags", and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose Easyllp: a flexible and simple-to-implement debiasing approach based on a…
▽ More
We consider the problem of Learning from Label Proportions (LLP), a weakly supervised classification setup where instances are grouped into "bags", and only the frequency of class labels at each bag is available. Albeit, the objective of the learner is to achieve low task loss at an individual instance level. Here we propose Easyllp: a flexible and simple-to-implement debiasing approach based on aggregate labels, which operates on arbitrary loss functions. Our technique allows us to accurately estimate the expected loss of an arbitrary model at an individual level. We showcase the flexibility of our approach by applying it to popular learning frameworks, like Empirical Risk Minimization (ERM) and Stochastic Gradient Descent (SGD) with provable guarantees on instance level performance. More concretely, we exhibit a variance reduction technique that makes the quality of LLP learning deteriorate only by a factor of k (k being bag size) in both ERM and SGD setups, as compared to full supervision. Finally, we validate our theoretical results on multiple datasets demonstrating our algorithm performs as well or better than previous LLP approaches in spite of its simplicity.
△ Less
Submitted 13 February, 2023; v1 submitted 6 February, 2023;
originally announced February 2023.
-
Confidence-Ranked Reconstruction of Census Microdata from Published Statistics
Authors:
Travis Dick,
Cynthia Dwork,
Michael Kearns,
Terrance Liu,
Aaron Roth,
Giuseppe Vietri,
Zhiwei Steven Wu
Abstract:
A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics…
▽ More
A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
△ Less
Submitted 6 February, 2023; v1 submitted 6 November, 2022;
originally announced November 2022.
-
Learning-Augmented Private Algorithms for Multiple Quantile Release
Authors:
Mikhail Khodak,
Kareem Amin,
Travis Dick,
Sergei Vassilvitskii
Abstract:
When applying differential privacy to sensitive data, we can often improve performance using external information such as other sensitive data, public data, or human priors. We propose to use the learning-augmented algorithms (or algorithms with predictions) framework -- previously applied largely to improve time complexity or competitive ratios -- as a powerful way of designing and analyzing priv…
▽ More
When applying differential privacy to sensitive data, we can often improve performance using external information such as other sensitive data, public data, or human priors. We propose to use the learning-augmented algorithms (or algorithms with predictions) framework -- previously applied largely to improve time complexity or competitive ratios -- as a powerful way of designing and analyzing privacy-preserving methods that can take advantage of such external information to improve utility. This idea is instantiated on the important task of multiple quantile release, for which we derive error guarantees that scale with a natural measure of prediction quality while (almost) recovering state-of-the-art prediction-independent guarantees. Our analysis enjoys several advantages, including minimal assumptions about the data, a natural way of adding robustness, and the provision of useful surrogate losses for two novel ``meta" algorithms that learn predictions from other (potentially sensitive) data. We conclude with experiments on challenging tasks demonstrating that learning predictions across one or more instances can lead to large error reductions while preserving privacy.
△ Less
Submitted 8 May, 2023; v1 submitted 20 October, 2022;
originally announced October 2022.
-
Comparing Unit Trains versus Manifest Trains for the Risk of Rail Transport of Hazardous Materials -- Part II: Application and Case Study
Authors:
Di Kang,
Jiaxi Zhao,
C. Tyler Dick,
Xiang Liu,
Zheyong Bian,
Steven W. Kirkpatrick,
Chen-Yu Lin
Abstract:
Built upon the risk analysis methodology (presented in the part I paper), this part II paper focuses on applying this methodology. Five illustrative scenarios were used to analyze the best or worst cases and compare the transportation risk differences between service options using unit trains and manifest trains. The comparison results indicate that if all tank cars are placed at the positions wit…
▽ More
Built upon the risk analysis methodology (presented in the part I paper), this part II paper focuses on applying this methodology. Five illustrative scenarios were used to analyze the best or worst cases and compare the transportation risk differences between service options using unit trains and manifest trains. The comparison results indicate that if all tank cars are placed at the positions with the lowest probability of derailing and if switching tank cars alone in classification yards, it could provide the lowest risk estimate given the same transportation demand (i.e., number of tank cars to transport). This paper also shows that based on the data and parameters in the case study, risks during arrival/departure events and yard switching events could be as significant as risks that on mainlines. This paper provides a way to use the risk analysis methodology for rail safety decisions. The methodology and its application can be tailored to specific infrastructure and rolling stock characteristics.
△ Less
Submitted 4 July, 2022;
originally announced August 2022.
-
Comparing Unit Trains versus Manifest Trains for the Risk of Rail Transport of Hazardous Materials -- Part I: Risk Analysis Methodology
Authors:
Di Kang,
Jiaxi Zhao,
C. Tyler Dick,
Xiang Liu,
Zheyong Bian,
Steven W. Kirkpatrick,
Chen-Yu Lin
Abstract:
Transporting hazardous materials (hazmats) using tank cars has more significant economic benefits than other transportation modes. Although railway transportation is roughly four times more fuel-efficient than roadway transportation, a train derailment has greater potential to cause more disastrous consequences than a truck incident. Train types, such as unit train or manifest train (also called m…
▽ More
Transporting hazardous materials (hazmats) using tank cars has more significant economic benefits than other transportation modes. Although railway transportation is roughly four times more fuel-efficient than roadway transportation, a train derailment has greater potential to cause more disastrous consequences than a truck incident. Train types, such as unit train or manifest train (also called mixed train), can influence transport risks in several ways. For example, unit trains only experience risks on mainlines and when arriving at or departing from terminals, while manifest trains experience additional switching risks in yards. Based on prior studies and various data sources covering the years 1996-2018, this paper constructs event chains for line-haul risks on mainlines (for both unit trains and manifest trains), arrival/departure risks in terminals (for unit trains) and yards (for manifest trains), and yard switching risks for manifest trains using various probabilistic models, and finally determines expected casualties as the consequences of a potential train derailment and release incident. This is the first analysis to quantify the total risks a train may encounter throughout the shipment process, either on mainlines or in yards/terminals, distinguishing train types. It provides a methodology applicable to any train to calculate the expected risks (quantified as expected casualties in this paper) from an origin to a destination.
△ Less
Submitted 4 July, 2022;
originally announced July 2022.
-
Combining Sobolev Smoothing with Parameterized Shape Optimization
Authors:
Thomas Dick,
Stephan Schmidt,
Nicolas R. Gauger
Abstract:
On the one hand, Sobolev gradient smoothing can considerably improve the performance of aerodynamic shape optimization and prevent issues with regularity. On the other hand, Sobolev smoothing can also be interpreted as an approximation for the shape Hessian. This paper demonstrates, how Sobolev smoothing, interpreted as a shape Hessian approximation, offers considerable benefits, although the para…
▽ More
On the one hand, Sobolev gradient smoothing can considerably improve the performance of aerodynamic shape optimization and prevent issues with regularity. On the other hand, Sobolev smoothing can also be interpreted as an approximation for the shape Hessian. This paper demonstrates, how Sobolev smoothing, interpreted as a shape Hessian approximation, offers considerable benefits, although the parameterization is smooth in itself already. Such an approach is especially beneficial in the context of simultaneous analysis and design, where we deal with inexact flow and adjoint solutions, also called One Shot optimization. Furthermore, the incorporation of the parameterization allows for direct application to engineering test cases, where shapes are always described by a CAD model. The new methodology presented in this paper is used for reference test cases from aerodynamic shape optimization and performance improvements in comparison to a classical Quasi-Newton scheme are shown.
△ Less
Submitted 19 March, 2022; v1 submitted 30 September, 2021;
originally announced September 2021.
-
Scalable and Provably Accurate Algorithms for Differentially Private Distributed Decision Tree Learning
Authors:
Kaiwen Wang,
Travis Dick,
Maria-Florina Balcan
Abstract:
This paper introduces the first provably accurate algorithms for differentially private, top-down decision tree learning in the distributed setting (Balcan et al., 2012). We propose DP-TopDown, a general privacy preserving decision tree learning algorithm, and present two distributed implementations. Our first method NoisyCounts naturally extends the single machine algorithm by using the Laplace m…
▽ More
This paper introduces the first provably accurate algorithms for differentially private, top-down decision tree learning in the distributed setting (Balcan et al., 2012). We propose DP-TopDown, a general privacy preserving decision tree learning algorithm, and present two distributed implementations. Our first method NoisyCounts naturally extends the single machine algorithm by using the Laplace mechanism. Our second method LocalRNM significantly reduces communication and added noise by performing local optimization at each data holder. We provide the first utility guarantees for differentially private top-down decision tree learning in both the single machine and distributed settings. These guarantees show that the error of the privately-learned decision tree quickly goes to zero provided that the dataset is sufficiently large. Our extensive experiments on real datasets illustrate the trade-offs of privacy, accuracy and generalization when learning private decision trees in the distributed setting.
△ Less
Submitted 22 February, 2021; v1 submitted 19 December, 2020;
originally announced December 2020.
-
Algorithms and Learning for Fair Portfolio Design
Authors:
Emily Diana,
Travis Dick,
Hadi Elzayn,
Michael Kearns,
Aaron Roth,
Zachary Schutzman,
Saeed Sharifi-Malvajerdi,
Juba Ziani
Abstract:
We consider a variation on the classical finance problem of optimal portfolio design. In our setting, a large population of consumers is drawn from some distribution over risk tolerances, and each consumer must be assigned to a portfolio of lower risk than her tolerance. The consumers may also belong to underlying groups (for instance, of demographic properties or wealth), and the goal is to desig…
▽ More
We consider a variation on the classical finance problem of optimal portfolio design. In our setting, a large population of consumers is drawn from some distribution over risk tolerances, and each consumer must be assigned to a portfolio of lower risk than her tolerance. The consumers may also belong to underlying groups (for instance, of demographic properties or wealth), and the goal is to design a small number of portfolios that are fair across groups in a particular and natural technical sense.
Our main results are algorithms for optimal and near-optimal portfolio design for both social welfare and fairness objectives, both with and without assumptions on the underlying group structure. We describe an efficient algorithm based on an internal two-player zero-sum game that learns near-optimal fair portfolios ex ante and show experimentally that it can be used to obtain a small set of fair portfolios ex post as well. For the special but natural case in which group structure coincides with risk tolerances (which models the reality that wealthy consumers generally tolerate greater risk), we give an efficient and optimal fair algorithm. We also provide generalization guarantees for the underlying risk distribution that has no dependence on the number of portfolios and illustrate the theory with simulation results.
△ Less
Submitted 12 June, 2020;
originally announced June 2020.
-
Random Smoothing Might be Unable to Certify $\ell_\infty$ Robustness for High-Dimensional Images
Authors:
Avrim Blum,
Travis Dick,
Naren Manoj,
Hongyang Zhang
Abstract:
We show a hardness result for random smoothing to achieve certified adversarial robustness against attacks in the $\ell_p$ ball of radius $ε$ when $p>2$. Although random smoothing has been well understood for the $\ell_2$ case using the Gaussian distribution, much remains unknown concerning the existence of a noise distribution that works for the case of $p>2$. This has been posed as an open probl…
▽ More
We show a hardness result for random smoothing to achieve certified adversarial robustness against attacks in the $\ell_p$ ball of radius $ε$ when $p>2$. Although random smoothing has been well understood for the $\ell_2$ case using the Gaussian distribution, much remains unknown concerning the existence of a noise distribution that works for the case of $p>2$. This has been posed as an open problem by Cohen et al. (2019) and includes many significant paradigms such as the $\ell_\infty$ threat model. In this work, we show that any noise distribution $\mathcal{D}$ over $\mathbb{R}^d$ that provides $\ell_p$ robustness for all base classifiers with $p>2$ must satisfy $\mathbb{E}η_i^2=Ω(d^{1-2/p}ε^2(1-δ)/δ^2)$ for 99% of the features (pixels) of vector $η\sim\mathcal{D}$, where $ε$ is the robust radius and $δ$ is the score gap between the highest-scored class and the runner-up. Therefore, for high-dimensional images with pixel values bounded in $[0,255]$, the required noise will eventually dominate the useful information in the images, leading to trivial smoothed classifiers.
△ Less
Submitted 5 March, 2020; v1 submitted 9 February, 2020;
originally announced February 2020.
-
Mechanoradicals in tensed tendon collagen as a new source of oxidative stress
Authors:
Christopher Zapp,
Agnieszka Obarska-Kosinska,
Benedikt Rennekamp,
Davide Mercadante,
Uladzimir Barayeu,
Tobias P. Dick,
Vasyl Denysenkov,
Thomas Prisner,
Marina Bennati,
Csaba Daday,
Reinhard Kappl,
Frauke Gräter
Abstract:
As established nearly a century ago, mechanoradicals originate from homolytic bond scission in polymers. The existence, nature and biological relevance of mechanoradicals in proteins, instead, are unknown. We here show that mechanical stress on collagen produces radicals and subsequently reactive oxygen species, essential biological signaling molecules. Electron-paramagnetic resonance (EPR) spectr…
▽ More
As established nearly a century ago, mechanoradicals originate from homolytic bond scission in polymers. The existence, nature and biological relevance of mechanoradicals in proteins, instead, are unknown. We here show that mechanical stress on collagen produces radicals and subsequently reactive oxygen species, essential biological signaling molecules. Electron-paramagnetic resonance (EPR) spectroscopy of stretched rat tail tendon, atomistic Molecular Dynamics simulations and quantum calculations show that the radicals form by bond scission in the direct vicinity of crosslinks in collagen. Radicals migrate to adjacent clusters of aromatic residues and stabilize on oxidized tyrosyl radicals, giving rise to a distinct EPR spectrum consistent with a stable dihydroxyphenylalanine (DOPA) radical. The protein mechanoradicals, as a yet undiscovered source of oxidative stress, finally convert into hydrogen peroxide. Our study suggests collagen I to have evolved as a radical sponge against mechano-oxidative damage and proposes a new mechanism for exercise-induced oxidative stress and redox-mediated pathophysiological processes.
△ Less
Submitted 27 October, 2019;
originally announced October 2019.
-
How much data is sufficient to learn high-performing algorithms? Generalization guarantees for data-driven algorithm design
Authors:
Maria-Florina Balcan,
Dan DeBlasio,
Travis Dick,
Carl Kingsford,
Tuomas Sandholm,
Ellen Vitercik
Abstract:
Algorithms often have tunable parameters that impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case in…
▽ More
Algorithms often have tunable parameters that impact performance metrics such as runtime and solution quality. For many algorithms used in practice, no parameter settings admit meaningful worst-case bounds, so the parameters are made available for the user to tune. Alternatively, parameters may be tuned implicitly within the proof of a worst-case approximation ratio or runtime bound. Worst-case instances, however, may be rare or nonexistent in practice. A growing body of research has demonstrated that data-driven algorithm design can lead to significant improvements in performance. This approach uses a training set of problem instances sampled from an unknown, application-specific distribution and returns a parameter setting with strong average performance on the training set.
We provide a broadly applicable theory for deriving generalization guarantees that bound the difference between the algorithm's average performance over the training set and its expected performance. Our results apply no matter how the parameters are tuned, be it via an automated or manual approach. The challenge is that for many types of algorithms, performance is a volatile function of the parameters: slightly perturbing the parameters can cause large changes in behavior. Prior research has proved generalization bounds by employing case-by-case analyses of greedy algorithms, clustering algorithms, integer programming algorithms, and selling mechanisms. We uncover a unifying structure which we use to prove extremely general guarantees, yet we recover the bounds from prior research. Our guarantees apply whenever an algorithm's performance is a piecewise-constant, -linear, or -- more generally -- piecewise-structured function of its parameters. Our theory also implies novel bounds for voting mechanisms and dynamic programming algorithms from computational biology.
△ Less
Submitted 25 April, 2021; v1 submitted 7 August, 2019;
originally announced August 2019.
-
Learning piecewise Lipschitz functions in changing environments
Authors:
Maria-Florina Balcan,
Travis Dick,
Dravyansh Sharma
Abstract:
Optimization in the presence of sharp (non-Lipschitz), unpredictable (w.r.t. time and amount) changes is a challenging and largely unexplored problem of great significance. We consider the class of piecewise Lipschitz functions, which is the most general online setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where uti…
▽ More
Optimization in the presence of sharp (non-Lipschitz), unpredictable (w.r.t. time and amount) changes is a challenging and largely unexplored problem of great significance. We consider the class of piecewise Lipschitz functions, which is the most general online setting considered in the literature for the problem, and arises naturally in various combinatorial algorithm selection problems where utility functions can have sharp discontinuities. The usual performance metric of $\mathit{static}$ regret minimizes the gap between the payoff accumulated and that of the best fixed point for the entire duration, and thus fails to capture changing environments. Shifting regret is a useful alternative, which allows for up to $s$ environment shifts. In this work we provide an $O(\sqrt{sdT\log T}+sT^{1-β})$ regret bound for $β$-dispersed functions, where $β$ roughly quantifies the rate at which discontinuities appear in the utility functions in expectation (typically $β\ge1/2$ in problems of practical interest). We also present a lower bound tight up to sub-logarithmic factors. We further obtain improved bounds when selecting from a small pool of experts. We empirically demonstrate a key application of our algorithms to online clustering problems on popular benchmarks.
△ Less
Submitted 6 August, 2020; v1 submitted 22 July, 2019;
originally announced July 2019.
-
Learning to Link
Authors:
Maria-Florina Balcan,
Travis Dick,
Manuel Lang
Abstract:
Clustering is an important part of many modern data analysis pipelines, including network analysis and data retrieval. There are many different clustering algorithms developed by various communities, and it is often not clear which algorithm will give the best performance on a specific clustering task. Similarly, we often have multiple ways to measure distances between data points, and the best cl…
▽ More
Clustering is an important part of many modern data analysis pipelines, including network analysis and data retrieval. There are many different clustering algorithms developed by various communities, and it is often not clear which algorithm will give the best performance on a specific clustering task. Similarly, we often have multiple ways to measure distances between data points, and the best clustering performance might require a non-trivial combination of those metrics. In this work, we study data-driven algorithm selection and metric learning for clustering problems, where the goal is to simultaneously learn the best algorithm and metric for a specific application. The family of clustering algorithms we consider is parameterized linkage based procedures that includes single and complete linkage. The family of distance functions we learn over are convex combinations of base distance functions. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and simultaneously learn both a near-optimal distance and clustering algorithm from these classes. We also carry out a comprehensive empirical evaluation of our techniques showing that they can lead to significantly improved clustering performance.
△ Less
Submitted 2 October, 2019; v1 submitted 1 July, 2019;
originally announced July 2019.
-
Semi-bandit Optimization in the Dispersed Setting
Authors:
Maria-Florina Balcan,
Travis Dick,
Wesley Pegden
Abstract:
The goal of data-driven algorithm design is to obtain high-performing algorithms for specific application domains using machine learning and data. Across many fields in AI, science, and engineering, practitioners will often fix a family of parameterized algorithms and then optimize those parameters to obtain good performance on example instances from the application domain. In the online setting,…
▽ More
The goal of data-driven algorithm design is to obtain high-performing algorithms for specific application domains using machine learning and data. Across many fields in AI, science, and engineering, practitioners will often fix a family of parameterized algorithms and then optimize those parameters to obtain good performance on example instances from the application domain. In the online setting, we must choose algorithm parameters for each instance as they arrive, and our goal is to be competitive with the best fixed algorithm in hindsight.
There are two major challenges in online data-driven algorithm design. First, it can be computationally expensive to evaluate the loss functions that map algorithm parameters to performance, which often require the learner to run a combinatorial algorithm to measure its performance. Second, the losses can be extremely volatile and have sharp discontinuities. However, we show that in many applications, evaluating the loss function for one algorithm choice can sometimes reveal the loss for a range of similar algorithms, essentially for free. We develop online optimization algorithms capable of using this kind of extra information by working in the semi-bandit feedback setting. Our algorithms achieve regret bounds that are essentially as good as algorithms under full-information feedback and are significantly more computationally efficient. We apply our semi-bandit results to obtain the first provable guarantees for data-driven algorithm design for linkage-based clustering and we improve the best regret bounds for designing greedy knapsack algorithms.
△ Less
Submitted 21 December, 2020; v1 submitted 18 April, 2019;
originally announced April 2019.
-
Envy-Free Classification
Authors:
Maria-Florina Balcan,
Travis Dick,
Ritesh Noothigattu,
Ariel D. Procaccia
Abstract:
In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else's. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classif…
▽ More
In classic fair division problems such as cake cutting and rent division, envy-freeness requires that each individual (weakly) prefer his allocation to anyone else's. On a conceptual level, we argue that envy-freeness also provides a compelling notion of fairness for classification tasks. Our technical focus is the generalizability of envy-free classification, i.e., understanding whether a classifier that is envy free on a sample would be almost envy free with respect to the underlying distribution with high probability. Our main result establishes that a small sample is sufficient to achieve such guarantees, when the classifier in question is a mixture of deterministic classifiers that belong to a family of low Natarajan dimension.
△ Less
Submitted 24 September, 2020; v1 submitted 23 September, 2018;
originally announced September 2018.
-
Data-Driven Clustering via Parameterized Lloyd's Families
Authors:
Maria-Florina Balcan,
Travis Dick,
Colin White
Abstract:
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given cl…
▽ More
Algorithms for clustering points in metric spaces is a long-studied area of research. Clustering has seen a multitude of work both theoretically, in understanding the approximation guarantees possible for many objective functions such as k-median and k-means clustering, and experimentally, in finding the fastest algorithms and seeding procedures for Lloyd's algorithm. The performance of a given clustering algorithm depends on the specific application at hand, and this may not be known up front. For example, a "typical instance" may vary depending on the application, and different clustering heuristics perform differently depending on the instance.
In this paper, we define an infinite family of algorithms generalizing Lloyd's algorithm, with one parameter controlling the initialization procedure, and another parameter controlling the local search procedure. This family of algorithms includes the celebrated k-means++ algorithm, as well as the classic farthest-first traversal algorithm. We design efficient learning algorithms which receive samples from an application-specific distribution over clustering instances and learn a near-optimal clustering algorithm from the class. We show the best parameters vary significantly across datasets such as MNIST, CIFAR, and mixtures of Gaussians. Our learned algorithms never perform worse than k-means++, and on some datasets we see significant improvements.
△ Less
Submitted 24 May, 2019; v1 submitted 18 September, 2018;
originally announced September 2018.
-
Learning to Branch
Authors:
Maria-Florina Balcan,
Travis Dick,
Tuomas Sandholm,
Ellen Vitercik
Abstract:
Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and nonconvex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms recursively partition the search space to find an optimal solution. In order to keep the tree size small, it is crucial to carefu…
▽ More
Tree search algorithms, such as branch-and-bound, are the most widely used tools for solving combinatorial and nonconvex problems. For example, they are the foremost method for solving (mixed) integer programs and constraint satisfaction problems. Tree search algorithms recursively partition the search space to find an optimal solution. In order to keep the tree size small, it is crucial to carefully decide, when expanding a tree node, which question (typically variable) to branch on at that node in order to partition the remaining space. Numerous partitioning techniques (e.g., variable selection) have been proposed, but there is no theory describing which technique is optimal. We show how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution. We provide the first sample complexity guarantees for tree search algorithm configuration. These guarantees bound the number of samples sufficient to ensure that the empirical performance of an algorithm over the samples nearly matches its expected performance on the unknown instance distribution. This thorough theoretical investigation naturally gives rise to our learning algorithm. Via experiments, we show that learning an optimal weighting of partitioning procedures can dramatically reduce tree size, and we prove that this reduction can even be exponential. Through theory and experiments, we show that learning to branch is both practical and hugely beneficial.
△ Less
Submitted 16 May, 2018; v1 submitted 27 March, 2018;
originally announced March 2018.
-
Dispersion for Data-Driven Algorithm Design, Online Learning, and Private Optimization
Authors:
Maria-Florina Balcan,
Travis Dick,
Ellen Vitercik
Abstract:
Data-driven algorithm design, that is, choosing the best algorithm for a specific application, is a crucial problem in modern data science. Practitioners often optimize over a parameterized algorithm family, tuning parameters based on problems from their domain. These procedures have historically come with no guarantees, though a recent line of work studies algorithm selection from a theoretical p…
▽ More
Data-driven algorithm design, that is, choosing the best algorithm for a specific application, is a crucial problem in modern data science. Practitioners often optimize over a parameterized algorithm family, tuning parameters based on problems from their domain. These procedures have historically come with no guarantees, though a recent line of work studies algorithm selection from a theoretical perspective. We advance the foundations of this field in several directions: we analyze online algorithm selection, where problems arrive one-by-one and the goal is to minimize regret, and private algorithm selection, where the goal is to find good parameters over a set of problems without revealing sensitive information contained therein. We study important algorithm families, including SDP-rounding schemes for problems formulated as integer quadratic programs, and greedy techniques for canonical subset selection problems. In these cases, the algorithm's performance is a volatile and piecewise Lipschitz function of its parameters, since tweaking the parameters can completely change the algorithm's behavior. We give a sufficient and general condition, dispersion, defining a family of piecewise Lipschitz functions that can be optimized online and privately, which includes the functions measuring the performance of the algorithms we study. Intuitively, a set of piecewise Lipschitz functions is dispersed if no small region contains many of the functions' discontinuities. We present general techniques for online and private optimization of the sum of dispersed piecewise Lipschitz functions. We improve over the best-known regret bounds for a variety of problems, prove regret bounds for problems not previously studied, and give matching lower bounds. We also give matching upper and lower bounds on the utility loss due to privacy. Moreover, we uncover dispersion in auction design and pricing problems.
△ Less
Submitted 22 October, 2018; v1 submitted 8 November, 2017;
originally announced November 2017.
-
On ideal dynamic climbing ropes
Authors:
Davit Harutyunyan,
Graeme W. Milton,
Trevor J. Dick,
Justin Boyer
Abstract:
We consider the rope climber fall problem in two different settings. The simplest formulation of the problem is when the climber falls from a given altitude and is attached to one end of the rope while the other end of the rope is attached to the rock at a given height. The problem is then finding the properties of the rope for which the peak force felt by the climber during the fall is minimal. T…
▽ More
We consider the rope climber fall problem in two different settings. The simplest formulation of the problem is when the climber falls from a given altitude and is attached to one end of the rope while the other end of the rope is attached to the rock at a given height. The problem is then finding the properties of the rope for which the peak force felt by the climber during the fall is minimal. The second problem of our consideration is again minimizing the same quantity in the presence of a carabiner. We will call such ropes \textit{mathematically ideal.} Given the height of the carabiner, the initial height and the mass of the climber, the length of the unstretched rope, and the distance between the belayer and the carabineer, we find the optimal (in the sense of minimized the peak force to a given elongation) dynamic rope in the framework of nonlinear elasticity. Wires of shape memory materials have some of the desired features of the tension-strain relation of a mathematically ideal dynamic rope, namely a plateau in the tension over a range of strains. With a suitable hysteresis loop, they also absorb essentially all the energy from the fall, thus making them an ideal rope in this sense too.
△ Less
Submitted 14 November, 2016;
originally announced November 2016.
-
Data Driven Resource Allocation for Distributed Learning
Authors:
Travis Dick,
Mu Li,
Venkata Krishna Pillutla,
Colin White,
Maria Florina Balcan,
Alex Smola
Abstract:
In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present…
▽ More
In distributed machine learning, data is dispatched to multiple machines for processing. Motivated by the fact that similar data points often belong to the same or similar classes, and more generally, classification rules of high accuracy tend to be "locally simple but globally complex" (Vapnik & Bottou 1993), we propose data dependent dispatching that takes advantage of such structure. We present an in-depth analysis of this model, providing new algorithms with provable worst-case guarantees, analysis proving existing scalable heuristics perform well in natural non worst-case conditions, and techniques for extending a dispatching rule from a small sample to the entire distribution. We overcome novel technical challenges to satisfy important conditions for accurate distributed learning, including fault tolerance and balancedness. We empirically compare our approach with baselines based on random partitioning, balanced partition trees, and locality sensitive hashing, showing that we achieve significantly higher accuracy on both synthetic and real world image and advertising datasets. We also demonstrate that our technique strongly scales with the available computing power.
△ Less
Submitted 15 December, 2016; v1 submitted 15 December, 2015;
originally announced December 2015.
-
Label Efficient Learning by Exploiting Multi-class Output Codes
Authors:
Maria Florina Balcan,
Travis Dick,
Yishay Mansour
Abstract:
We present a new perspective on the popular multi-class algorithmic techniques of one-vs-all and error correcting output codes. Rather than studying the behavior of these techniques for supervised learning, we establish a connection between the success of these methods and the existence of label-efficient learning procedures. We show that in both the realizable and agnostic cases, if output codes…
▽ More
We present a new perspective on the popular multi-class algorithmic techniques of one-vs-all and error correcting output codes. Rather than studying the behavior of these techniques for supervised learning, we establish a connection between the success of these methods and the existence of label-efficient learning procedures. We show that in both the realizable and agnostic cases, if output codes are successful at learning from labeled data, they implicitly assume structure on how the classes are related. By making that structure explicit, we design learning algorithms to recover the classes with low label complexity. We provide results for the commonly studied cases of one-vs-all learning and when the codewords of the classes are well separated. We additionally consider the more challenging case where the codewords are not well separated, but satisfy a boundary features condition that captures the natural intuition that every bit of the codewords should be significant.
△ Less
Submitted 25 November, 2016; v1 submitted 10 November, 2015;
originally announced November 2015.