-
Revisiting Model-Agnostic Private Learning: Faster Rates and Active Learning
Authors:
Chong Liu,
Yuqing Zhu,
Kamalika Chaudhuri,
Yu-Xiang Wang
Abstract:
The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill i…
▽ More
The Private Aggregation of Teacher Ensembles (PATE) framework is one of the most promising recent approaches in differentially private learning. Existing theoretical analysis shows that PATE consistently learns any VC-classes in the realizable setting, but falls short in explaining its success in more general cases where the error rate of the optimal classifier is bounded away from zero. We fill in this gap by introducing the Tsybakov Noise Condition (TNC) and establish stronger and more interpretable learning bounds. These bounds provide new insights into when PATE works and improve over existing results even in the narrower realizable setting. We also investigate the compelling idea of using active learning for saving privacy budget, and empirical studies show the effectiveness of this new idea. The novel components in the proofs include a more refined analysis of the majority voting classifier - which could be of independent interest - and an observation that the synthetic "student" learning problem is nearly realizable by construction under the Tsybakov noise condition.
△ Less
Submitted 11 March, 2022; v1 submitted 5 November, 2020;
originally announced November 2020.
-
Multitask Bandit Learning Through Heterogeneous Feedback Aggregation
Authors:
Zhi Wang,
Chicheng Zhang,
Manish Kumar Singh,
Laurel D. Riek,
Kamalika Chaudhuri
Abstract:
In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $ε$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily id…
▽ More
In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the $ε$-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg$(ε)$, that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise similarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise similarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.
△ Less
Submitted 19 July, 2021; v1 submitted 29 October, 2020;
originally announced October 2020.
-
Locally Differentially Private Analysis of Graph Statistics
Authors:
Jacob Imola,
Takao Murakami,
Kamalika Chaudhuri
Abstract:
Differentially private analysis of graphs is widely used for releasing statistics from sensitive graphs while still preserving user privacy. Most existing algorithms however are in a centralized privacy model, where a trusted data curator holds the entire graph. As this model raises a number of privacy and security issues -- such as, the trustworthiness of the curator and the possibility of data b…
▽ More
Differentially private analysis of graphs is widely used for releasing statistics from sensitive graphs while still preserving user privacy. Most existing algorithms however are in a centralized privacy model, where a trusted data curator holds the entire graph. As this model raises a number of privacy and security issues -- such as, the trustworthiness of the curator and the possibility of data breaches, it is desirable to consider algorithms in a more decentralized local model where no server holds the entire graph.
In this work, we consider a local model, and present algorithms for counting subgraphs -- a fundamental task for analyzing the connection patterns in a graph -- with LDP (Local Differential Privacy). For triangle counts, we present algorithms that use one and two rounds of interaction, and show that an additional round can significantly improve the utility. For $k$-star counts, we present an algorithm that achieves an order optimal estimation error in the non-interactive local model. We provide new lower-bounds on the estimation error for general graph statistics including triangle counts and $k$-star counts. Finally, we perform extensive experiments on two real datasets, and show that it is indeed possible to accurately estimate subgraph counts in the local differential privacy model.
△ Less
Submitted 11 February, 2021; v1 submitted 16 October, 2020;
originally announced October 2020.
-
Trustworthy AI Inference Systems: An Industry Research View
Authors:
Rosario Cammarota,
Matthias Schunter,
Anand Rajan,
Fabian Boemer,
Ágnes Kiss,
Amos Treiber,
Christian Weinert,
Thomas Schneider,
Emmanuel Stapf,
Ahmad-Reza Sadeghi,
Daniel Demmler,
Joshua Stock,
Huili Chen,
Siam Umar Hussain,
Sadegh Riazi,
Farinaz Koushanfar,
Saransh Gupta,
Tajan Simunic Rosing,
Kamalika Chaudhuri,
Hamid Nejatollahi,
Nikil Dutt,
Mohsen Imani,
Kim Laine,
Anuj Dubey,
Aydin Aysu
, et al. (4 additional authors not shown)
Abstract:
In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also…
▽ More
In this work, we provide an industry research view for approaching the design, deployment, and operation of trustworthy Artificial Intelligence (AI) inference systems. Such systems provide customers with timely, informed, and customized inferences to aid their decision, while at the same time utilizing appropriate security protection mechanisms for AI models. Additionally, such systems should also use Privacy-Enhancing Technologies (PETs) to protect customers' data at any time. To approach the subject, we start by introducing current trends in AI inference systems. We continue by elaborating on the relationship between Intellectual Property (IP) and private data protection in such systems. Regarding the protection mechanisms, we survey the security and privacy building blocks instrumental in designing, building, deploying, and operating private AI inference systems. For example, we highlight opportunities and challenges in AI systems using trusted execution environments combined with more recent advances in cryptographic techniques to protect data in use. Finally, we outline areas of further development that require the global collective attention of industry, academia, and government researchers to sustain the operation of trustworthy AI inference systems.
△ Less
Submitted 10 February, 2023; v1 submitted 10 August, 2020;
originally announced August 2020.
-
Single and multi-mode directional lasing from arrays of dielectric nanoresonators
Authors:
Shaimaa I. Azzam,
Krishnakali Chaudhuri,
Alexei Lagutchev,
Zubin Jacob,
Young L. Kim,
Vladimir M. Shalaev,
Alexandra Boltasseva,
Alexander V. Kildishev
Abstract:
The strong electric and magnetic resonances in dielectric subwavelength structures have enabled unique opportunities for efficient manipulation of light-matter interactions. Besides, the dramatic enhancement of nonlinear light-matter interactions near so-called bound states in the continuum (BICs) has recently attracted enormous attention due to potential advancements in all-optical and quantum co…
▽ More
The strong electric and magnetic resonances in dielectric subwavelength structures have enabled unique opportunities for efficient manipulation of light-matter interactions. Besides, the dramatic enhancement of nonlinear light-matter interactions near so-called bound states in the continuum (BICs) has recently attracted enormous attention due to potential advancements in all-optical and quantum computing. However, the experimental realizations and the applications of high- Q factor resonances in dielectric resonances in the visible regime have thus far been considerably limited. In this work, we explore the interplay of electric and magnetic dipoles in arrays of dielectric nanoresonators to enhance light-matter interaction. We report on the experimental realization of high-Q factor resonances in the visible through the collective diffractive coupling of electric and magnetic dipoles. Providing direct physical insights, we also show that coupling the Rayleigh anomaly of the array with the electric and magnetic dipoles of the individual nanoresonators can result in the formation of different types of BICs. We utilize the resonances in the visible regime to achieve lasing action at room temperature with high spatial directionality and low threshold. Finally, we experimentally demonstrate multi-mode, directional lasing and study the BIC-assisted lasing mode engineering in arrays of dielectric nanoresonators. We believe that our results enable a new range of applications in flat photonics through realizing on-chip controllable single and multi-wavelength micro-lasers.
△ Less
Submitted 29 June, 2020;
originally announced June 2020.
-
The Expressive Power of a Class of Normalizing Flow Models
Authors:
Zhifeng Kong,
Kamalika Chaudhuri
Abstract:
Normalizing flows have received a great deal of recent attention as they allow flexible generative modeling as well as easy likelihood computation. While a wide variety of flow models have been proposed, there is little formal understanding of the representation power of these models. In this work, we study some basic normalizing flows and rigorously establish bounds on their expressive power. Our…
▽ More
Normalizing flows have received a great deal of recent attention as they allow flexible generative modeling as well as easy likelihood computation. While a wide variety of flow models have been proposed, there is little formal understanding of the representation power of these models. In this work, we study some basic normalizing flows and rigorously establish bounds on their expressive power. Our results indicate that while these flows are highly expressive in one dimension, in higher dimensions their representation power may be limited, especially when the flows have moderate depth.
△ Less
Submitted 30 May, 2020;
originally announced June 2020.
-
Successive Refinement of Privacy
Authors:
Antonious M. Girgis,
Deepesh Data,
Kamalika Chaudhuri,
Christina Fragouli,
Suhas Diggavi
Abstract:
This work examines a novel question: how much randomness is needed to achieve local differential privacy (LDP)? A motivating scenario is providing {\em multiple levels of privacy} to multiple analysts, either for distribution or for heavy-hitter estimation, using the \emph{same} (randomized) output. We call this setting \emph{successive refinement of privacy}, as it provides hierarchical access to…
▽ More
This work examines a novel question: how much randomness is needed to achieve local differential privacy (LDP)? A motivating scenario is providing {\em multiple levels of privacy} to multiple analysts, either for distribution or for heavy-hitter estimation, using the \emph{same} (randomized) output. We call this setting \emph{successive refinement of privacy}, as it provides hierarchical access to the raw data with different privacy levels. For example, the same randomized output could enable one analyst to reconstruct the input, while another can only estimate the distribution subject to LDP requirements. This extends the classical Shannon (wiretap) security setting to local differential privacy. We provide (order-wise) tight characterizations of privacy-utility-randomness trade-offs in several cases for distribution estimation, including the standard LDP setting under a randomness constraint. We also provide a non-trivial privacy mechanism for multi-level privacy. Furthermore, we show that we cannot reuse random keys over time while preserving privacy of each user.
△ Less
Submitted 24 May, 2020;
originally announced May 2020.
-
Detecting Parkinsonian Tremor from IMU Data Collected In-The-Wild using Deep Multiple-Instance Learning
Authors:
Alexandros Papadopoulos,
Konstantinos Kyritsis,
Lisa Klingelhoefer,
Sevasti Bostanjopoulou,
K. Ray Chaudhuri,
Anastasios Delopoulos
Abstract:
Parkinson's Disease (PD) is a slowly evolving neuro-logical disease that affects about 1% of the population above 60 years old, causing symptoms that are subtle at first, but whose intensity increases as the disease progresses. Automated detection of these symptoms could offer clues as to the early onset of the disease, thus improving the expected clinical outcomes of the patients via appropriatel…
▽ More
Parkinson's Disease (PD) is a slowly evolving neuro-logical disease that affects about 1% of the population above 60 years old, causing symptoms that are subtle at first, but whose intensity increases as the disease progresses. Automated detection of these symptoms could offer clues as to the early onset of the disease, thus improving the expected clinical outcomes of the patients via appropriately targeted interventions. This potential has led many researchers to develop methods that use widely available sensors to measure and quantify the presence of PD symptoms such as tremor, rigidity and braykinesia. However, most of these approaches operate under controlled settings, such as in lab or at home, thus limiting their applicability under free-living conditions. In this work, we present a method for automatically identifying tremorous episodes related to PD, based on IMU signals captured via a smartphone device. We propose a Multiple-Instance Learning approach, wherein a subject is represented as an unordered bag of accelerometer signal segments and a single, expert-provided, tremor annotation. Our method combines deep feature learning with a learnable pooling stage that is able to identify key instances within the subject bag, while still being trainable end-to-end. We validate our algorithm on a newly introduced dataset of 45 subjects, containing accelerometer signals collected entirely in-the-wild. The good classification performance obtained in the conducted experiments suggests that the proposed method can efficiently navigate the noisy environment of in-the-wild recordings.
△ Less
Submitted 6 May, 2020;
originally announced May 2020.
-
A Non-Parametric Test to Detect Data-Copying in Generative Models
Authors:
Casey Meehan,
Kamalika Chaudhuri,
Sanjoy Dasgupta
Abstract:
Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call {\em{data-copying}} -- where the generative model memorizes and outputs training samples or small variations thereof. We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the ta…
▽ More
Detecting overfitting in generative models is an important challenge in machine learning. In this work, we formalize a form of overfitting that we call {\em{data-copying}} -- where the generative model memorizes and outputs training samples or small variations thereof. We provide a three sample non-parametric test for detecting data-copying that uses the training set, a separate sample from the target distribution, and a generated sample from the model, and study the performance of our test on several canonical models and datasets.
For code \& examples, visit https://github.com/casey-meehan/data-copying
△ Less
Submitted 12 April, 2020;
originally announced April 2020.
-
When are Non-Parametric Methods Robust?
Authors:
Robi Bhattacharjee,
Kamalika Chaudhuri
Abstract:
A growing body of research has shown that many classifiers are susceptible to {\em{adversarial examples}} -- small strategic modifications to test inputs that lead to misclassification. In this work, we study general non-parametric methods, with a view towards understanding when they are robust to these modifications. We establish general conditions under which non-parametric methods are r-consist…
▽ More
A growing body of research has shown that many classifiers are susceptible to {\em{adversarial examples}} -- small strategic modifications to test inputs that lead to misclassification. In this work, we study general non-parametric methods, with a view towards understanding when they are robust to these modifications. We establish general conditions under which non-parametric methods are r-consistent -- in the sense that they converge to optimally robust and accurate classifiers in the large sample limit.
Concretely, our results show that when data is well-separated, nearest neighbors and kernel classifiers are r-consistent, while histograms are not. For general data distributions, we prove that preprocessing by Adversarial Pruning (Yang et. al., 2019) -- that makes data well-separated -- followed by nearest neighbors or kernel classifiers also leads to r-consistency.
△ Less
Submitted 28 December, 2020; v1 submitted 13 March, 2020;
originally announced March 2020.
-
A Closer Look at Accuracy vs. Robustness
Authors:
Yao-Yuan Yang,
Cyrus Rashtchian,
Hongyang Zhang,
Ruslan Salakhutdinov,
Kamalika Chaudhuri
Abstract:
Current methods for training robust networks lead to a drop in test accuracy, which has led prior works to posit that a robustness-accuracy tradeoff may be inevitable in deep learning. We take a closer look at this phenomenon and first show that real image datasets are actually separated. With this property in mind, we then prove that robustness and accuracy should both be achievable for benchmark…
▽ More
Current methods for training robust networks lead to a drop in test accuracy, which has led prior works to posit that a robustness-accuracy tradeoff may be inevitable in deep learning. We take a closer look at this phenomenon and first show that real image datasets are actually separated. With this property in mind, we then prove that robustness and accuracy should both be achievable for benchmark datasets through locally Lipschitz functions, and hence, there should be no inherent tradeoff between robustness and accuracy. Through extensive experiments with robustness methods, we argue that the gap between theory and practice arises from two limitations of current methods: either they fail to impose local Lipschitzness or they are insufficiently generalized. We explore combining dropout with robust training methods and obtain better generalization. We conclude that achieving robustness and accuracy in practice may require using methods that impose local Lipschitzness and augmenting them with deep learning generalization techniques. Code available at https://github.com/yangarbiter/robust-local-lipschitz
△ Less
Submitted 12 July, 2020; v1 submitted 5 March, 2020;
originally announced March 2020.
-
Approximate Data Deletion from Machine Learning Models
Authors:
Zachary Izzo,
Mary Anne Smart,
Kamalika Chaudhuri,
James Zou
Abstract:
Deleting data from a trained machine learning (ML) model is a critical task in many applications. For example, we may want to remove the influence of training points that might be out of date or outliers. Regulations such as EU's General Data Protection Regulation also stipulate that individuals can request to have their data deleted. The naive approach to data deletion is to retrain the ML model…
▽ More
Deleting data from a trained machine learning (ML) model is a critical task in many applications. For example, we may want to remove the influence of training points that might be out of date or outliers. Regulations such as EU's General Data Protection Regulation also stipulate that individuals can request to have their data deleted. The naive approach to data deletion is to retrain the ML model on the remaining data, but this is too time consuming. In this work, we propose a new approximate deletion method for linear and logistic models whose computational cost is linear in the the feature dimension $d$ and independent of the number of training data $n$. This is a significant gain over all existing methods, which all have superlinear time dependence on the dimension. We also develop a new feature-injection test to evaluate the thoroughness of data deletion from ML models.
△ Less
Submitted 23 February, 2021; v1 submitted 24 February, 2020;
originally announced February 2020.
-
Location Trace Privacy Under Conditional Priors
Authors:
Casey Meehan,
Kamalika Chaudhuri
Abstract:
Providing meaningful privacy to users of location based services is particularly challenging when multiple locations are revealed in a short period of time. This is primarily due to the tremendous degree of dependence that can be anticipated between points. We propose a Rényi differentially private framework for bounding expected privacy loss for conditionally dependent data. Additionally, we demo…
▽ More
Providing meaningful privacy to users of location based services is particularly challenging when multiple locations are revealed in a short period of time. This is primarily due to the tremendous degree of dependence that can be anticipated between points. We propose a Rényi differentially private framework for bounding expected privacy loss for conditionally dependent data. Additionally, we demonstrate an algorithm for achieving this privacy under Gaussian process conditional priors. This framework both exemplifies why conditionally dependent data is so challenging to protect and offers a strategy for preserving privacy to within a fixed radius for every user location in a trace.
△ Less
Submitted 9 December, 2019;
originally announced December 2019.
-
Capacity Bounded Differential Privacy
Authors:
Kamalika Chaudhuri,
Jacob Imola,
Ashwin Machanavajjhala
Abstract:
Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privac…
▽ More
Differential privacy, a notion of algorithmic stability, is a gold standard for measuring the additional risk an algorithm's output poses to the privacy of a single record in the dataset. Differential privacy is defined as the distance between the output distribution of an algorithm on neighboring datasets that differ in one entry. In this work, we present a novel relaxation of differential privacy, capacity bounded differential privacy, where the adversary that distinguishes output distributions is assumed to be capacity-bounded -- i.e. bounded not in computational power, but in terms of the function class from which their attack algorithm is drawn. We model adversaries in terms of restricted f-divergences between probability distributions, and study properties of the definition and algorithms that satisfy them.
△ Less
Submitted 3 July, 2019;
originally announced July 2019.
-
Robustness for Non-Parametric Classification: A Generic Attack and Defense
Authors:
Yao-Yuan Yang,
Cyrus Rashtchian,
Yizhen Wang,
Kamalika Chaudhuri
Abstract:
Adversarially robust machine learning has received much recent attention. However, prior attacks and defenses for non-parametric classifiers have been developed in an ad-hoc or classifier-specific basis. In this work, we take a holistic look at adversarial examples for non-parametric classifiers, including nearest neighbors, decision trees, and random forests. We provide a general defense method,…
▽ More
Adversarially robust machine learning has received much recent attention. However, prior attacks and defenses for non-parametric classifiers have been developed in an ad-hoc or classifier-specific basis. In this work, we take a holistic look at adversarial examples for non-parametric classifiers, including nearest neighbors, decision trees, and random forests. We provide a general defense method, adversarial pruning, that works by preprocessing the dataset to become well-separated. To test our defense, we provide a novel attack that applies to a wide range of non-parametric classifiers. Theoretically, we derive an optimally robust classifier, which is analogous to the Bayes Optimal. We show that adversarial pruning can be viewed as a finite sample approximation to this optimal classifier. We empirically show that our defense and attack are either better than or competitive with prior work on non-parametric classifiers. Overall, our results provide a strong and broadly-applicable baseline for future work on robust non-parametrics. Code available at https://github.com/yangarbiter/adversarial-nonparametrics/ .
△ Less
Submitted 24 February, 2020; v1 submitted 7 June, 2019;
originally announced June 2019.
-
The Label Complexity of Active Learning from Observational Data
Authors:
Songbai Yan,
Kamalika Chaudhuri,
Tara Javidi
Abstract:
Counterfactual learning from observational data involves learning a classifier on an entire population based on data that is observed conditioned on a selection policy. This work considers this problem in an active setting, where the learner additionally has access to unlabeled examples and can choose to get a subset of these labeled by an oracle.
Prior work on this problem uses disagreement-bas…
▽ More
Counterfactual learning from observational data involves learning a classifier on an entire population based on data that is observed conditioned on a selection policy. This work considers this problem in an active setting, where the learner additionally has access to unlabeled examples and can choose to get a subset of these labeled by an oracle.
Prior work on this problem uses disagreement-based active learning, along with an importance weighted loss estimator to account for counterfactuals, which leads to a high label complexity. We show how to instead incorporate a more efficient counterfactual risk minimizer into the active learning algorithm. This requires us to modify both the counterfactual risk to make it amenable to active learning, as well as the active learning process to make it amenable to the risk. We provably demonstrate that the result of this is an algorithm which is statistically consistent as well as more label-efficient than prior work.
△ Less
Submitted 27 October, 2019; v1 submitted 29 May, 2019;
originally announced May 2019.
-
An Investigation of Data Poisoning Defenses for Online Learning
Authors:
Yizhen Wang,
Somesh Jha,
Kamalika Chaudhuri
Abstract:
Data poisoning attacks -- where an adversary can modify a small fraction of training data, with the goal of forcing the trained classifier to high loss -- are an important threat for machine learning in many applications. While a body of prior work has developed attacks and defenses, there is not much general understanding on when various attacks and defenses are effective. In this work, we undert…
▽ More
Data poisoning attacks -- where an adversary can modify a small fraction of training data, with the goal of forcing the trained classifier to high loss -- are an important threat for machine learning in many applications. While a body of prior work has developed attacks and defenses, there is not much general understanding on when various attacks and defenses are effective. In this work, we undertake a rigorous study of defenses against data poisoning for online learning. First, we study four standard defenses in a powerful threat model, and provide conditions under which they can allow or resist rapid poisoning. We then consider a weaker and more realistic threat model, and show that the success of the adversary in the presence of data poisoning defenses there depends on the "ease" of the learning problem.
△ Less
Submitted 19 February, 2020; v1 submitted 28 May, 2019;
originally announced May 2019.
-
Profile-Based Privacy for Locally Private Computations
Authors:
Joseph Geumlek,
Kamalika Chaudhuri
Abstract:
Differential privacy has emerged as a gold standard in privacy-preserving data analysis. A popular variant is local differential privacy, where the data holder is the trusted curator. A major barrier, however, towards a wider adoption of this model is that it offers a poor privacy-utility tradeoff.
In this work, we address this problem by introducing a new variant of local privacy called profile…
▽ More
Differential privacy has emerged as a gold standard in privacy-preserving data analysis. A popular variant is local differential privacy, where the data holder is the trusted curator. A major barrier, however, towards a wider adoption of this model is that it offers a poor privacy-utility tradeoff.
In this work, we address this problem by introducing a new variant of local privacy called profile-based privacy. The central idea is that the problem setting comes with a graph G of data generating distributions, whose edges encode sensitive pairs of distributions that should be made indistinguishable. This provides higher utility because unlike local differential privacy, we no longer need to make every pair of private values in the domain indistinguishable, and instead only protect the identity of the underlying distribution. We establish privacy properties of the profile-based privacy definition, such as post-processing invariance and graceful composition. Finally, we provide mechanisms that are private in this framework, and show via simulations that they achieve higher utility than the corresponding local differential privacy mechanisms.
△ Less
Submitted 16 June, 2019; v1 submitted 20 January, 2019;
originally announced March 2019.
-
Exploring Connections Between Active Learning and Model Extraction
Authors:
Varun Chandrasekaran,
Kamalika Chaudhuri,
Irene Giacomelli,
Somesh Jha,
Songbai Yan
Abstract:
Machine learning is being increasingly used by individuals, research institutions, and corporations. This has resulted in the surge of Machine Learning-as-a-Service (MLaaS) - cloud services that provide (a) tools and resources to learn the model, and (b) a user-friendly query interface to access the model. However, such MLaaS systems raise privacy concerns such as model extraction. In model extrac…
▽ More
Machine learning is being increasingly used by individuals, research institutions, and corporations. This has resulted in the surge of Machine Learning-as-a-Service (MLaaS) - cloud services that provide (a) tools and resources to learn the model, and (b) a user-friendly query interface to access the model. However, such MLaaS systems raise privacy concerns such as model extraction. In model extraction attacks, adversaries maliciously exploit the query interface to steal the model. More precisely, in a model extraction attack, a good approximation of a sensitive or proprietary model held by the server is extracted (i.e. learned) by a dishonest user who interacts with the server only via the query interface. This attack was introduced by Tramer et al. at the 2016 USENIX Security Symposium, where practical attacks for various models were shown. We believe that better understanding the efficacy of model extraction attacks is paramount to designing secure MLaaS systems. To that end, we take the first step by (a) formalizing model extraction and discussing possible defense strategies, and (b) drawing parallels between model extraction and established area of active learning. In particular, we show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks, and investigate possible defense strategies.
△ Less
Submitted 19 November, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
The Inductive Bias of Restricted f-GANs
Authors:
Shuang Liu,
Kamalika Chaudhuri
Abstract:
Generative adversarial networks are a novel method for statistical inference that have achieved much empirical success; however, the factors contributing to this success remain ill-understood. In this work, we attempt to analyze generative adversarial learning -- that is, statistical inference as the result of a game between a generator and a discriminator -- with the view of understanding how it…
▽ More
Generative adversarial networks are a novel method for statistical inference that have achieved much empirical success; however, the factors contributing to this success remain ill-understood. In this work, we attempt to analyze generative adversarial learning -- that is, statistical inference as the result of a game between a generator and a discriminator -- with the view of understanding how it differs from classical statistical inference solutions such as maximum likelihood inference and the method of moments.
Specifically, we provide a theoretical characterization of the distribution inferred by a simple form of generative adversarial learning called restricted f-GANs -- where the discriminator is a function in a given function class, the distribution induced by the generator is restricted to lie in a pre-specified distribution class and the objective is similar to a variational form of the f-divergence. A consequence of our result is that for linear KL-GANs -- that is, when the discriminator is a linear function over some feature space and f corresponds to the KL-divergence -- the distribution induced by the optimal generator is neither the maximum likelihood nor the method of moments solution, but an interesting combination of both.
△ Less
Submitted 12 September, 2018;
originally announced September 2018.
-
Differentially Private Continual Release of Graph Statistics
Authors:
Shuang Song,
Susan Little,
Sanjay Mehta,
Staal Vinterbo,
Kamalika Chaudhuri
Abstract:
Motivated by understanding the dynamics of sensitive social networks over time, we consider the problem of continual release of statistics in a network that arrives online, while preserving privacy of its participants. For our privacy notion, we use differential privacy -- the gold standard in privacy for statistical data analysis. The main challenge in this problem is maintaining a good privacy-u…
▽ More
Motivated by understanding the dynamics of sensitive social networks over time, we consider the problem of continual release of statistics in a network that arrives online, while preserving privacy of its participants. For our privacy notion, we use differential privacy -- the gold standard in privacy for statistical data analysis. The main challenge in this problem is maintaining a good privacy-utility tradeoff; naive solutions that compose across time, as well as solutions suited to tabular data either lead to poor utility or do not directly apply. In this work, we show that if there is a publicly known upper bound on the maximum degree of any node in the entire network sequence, then we can release many common graph statistics such as degree distributions and subgraph counts continually with a better privacy-accuracy tradeoff.
Code available at https://bitbucket.org/shs037/graphprivacycode
△ Less
Submitted 18 September, 2018; v1 submitted 7 September, 2018;
originally announced September 2018.
-
Data Poisoning Attacks against Online Learning
Authors:
Yizhen Wang,
Kamalika Chaudhuri
Abstract:
We consider data poisoning attacks, a class of adversarial attacks on machine learning where an adversary has the power to alter a small fraction of the training data in order to make the trained classifier satisfy certain objectives. While there has been much prior work on data poisoning, most of it is in the offline setting, and attacks for online learning, where training data arrives in a strea…
▽ More
We consider data poisoning attacks, a class of adversarial attacks on machine learning where an adversary has the power to alter a small fraction of the training data in order to make the trained classifier satisfy certain objectives. While there has been much prior work on data poisoning, most of it is in the offline setting, and attacks for online learning, where training data arrives in a streaming manner, are not well understood.
In this work, we initiate a systematic investigation of data poisoning attacks for online learning. We formalize the problem into two settings, and we propose a general attack strategy, formulated as an optimization problem, that applies to both with some modifications. We propose three solution strategies, and perform extensive experimental evaluation. Finally, we discuss the implications of our findings for building successful defenses.
△ Less
Submitted 27 August, 2018;
originally announced August 2018.
-
Active Learning with Logged Data
Authors:
Songbai Yan,
Kamalika Chaudhuri,
Tara Javidi
Abstract:
We consider active learning with logged data, where labeled examples are drawn conditioned on a predetermined logging policy, and the goal is to learn a classifier on the entire population, not just conditioned on the logging policy. Prior work addresses this problem either when only logged data is available, or purely in a controlled random experimentation setting where the logged data is ignored…
▽ More
We consider active learning with logged data, where labeled examples are drawn conditioned on a predetermined logging policy, and the goal is to learn a classifier on the entire population, not just conditioned on the logging policy. Prior work addresses this problem either when only logged data is available, or purely in a controlled random experimentation setting where the logged data is ignored. In this work, we combine both approaches to provide an algorithm that uses logged data to bootstrap and inform experimentation, thus achieving the best of both worlds. Our work is inspired by a connection between controlled random experimentation and active learning, and modifies existing disagreement-based active learning algorithms to exploit logged data.
△ Less
Submitted 13 June, 2018; v1 submitted 25 February, 2018;
originally announced February 2018.
-
Spectral Learning of Binomial HMMs for DNA Methylation Data
Authors:
Chicheng Zhang,
Eran A. Mukamel,
Kamalika Chaudhuri
Abstract:
We consider learning parameters of Binomial Hidden Markov Models, which may be used to model DNA methylation data. The standard algorithm for the problem is EM, which is computationally expensive for sequences of the scale of the mammalian genome. Recently developed spectral algorithms can learn parameters of latent variable models via tensor decomposition, and are highly efficient for large data.…
▽ More
We consider learning parameters of Binomial Hidden Markov Models, which may be used to model DNA methylation data. The standard algorithm for the problem is EM, which is computationally expensive for sequences of the scale of the mammalian genome. Recently developed spectral algorithms can learn parameters of latent variable models via tensor decomposition, and are highly efficient for large data. However, these methods have only been applied to categorial HMMs, and the main challenge is how to extend them to Binomial HMMs while still retaining computational efficiency. We address this challenge by introducing a new feature-map based approach that exploits specific properties of Binomial HMMs. We provide theoretical performance guarantees for our algorithm and evaluate it on real DNA methylation data.
△ Less
Submitted 7 February, 2018;
originally announced February 2018.
-
Rényi Differential Privacy Mechanisms for Posterior Sampling
Authors:
Joseph Geumlek,
Shuang Song,
Kamalika Chaudhuri
Abstract:
Using a recently proposed privacy definition of Rényi Differential Privacy (RDP), we re-examine the inherent privacy of releasing a single sample from a posterior distribution. We exploit the impact of the prior distribution in mitigating the influence of individual data points. In particular, we focus on sampling from an exponential family and specific generalized linear models, such as logistic…
▽ More
Using a recently proposed privacy definition of Rényi Differential Privacy (RDP), we re-examine the inherent privacy of releasing a single sample from a posterior distribution. We exploit the impact of the prior distribution in mitigating the influence of individual data points. In particular, we focus on sampling from an exponential family and specific generalized linear models, such as logistic regression. We propose novel RDP mechanisms as well as offering a new RDP analysis for an existing method in order to add value to the RDP framework. Each method is capable of achieving arbitrary RDP privacy guarantees, and we offer experimental results of their efficacy.
△ Less
Submitted 2 October, 2017;
originally announced October 2017.
-
Einstein's Patents and Inventions
Authors:
Asis Kumar Chaudhuri
Abstract:
Times magazine selected Albert Einstein, the German born Jewish Scientist as the person of the 20th century. Undoubtedly, 20th century was the age of science and Einstein's contributions in unraveling mysteries of nature was unparalleled. However, few are aware that Einstein was also a great inventor. He and his collaborators had patented a wide variety of inventions in several countries. After a…
▽ More
Times magazine selected Albert Einstein, the German born Jewish Scientist as the person of the 20th century. Undoubtedly, 20th century was the age of science and Einstein's contributions in unraveling mysteries of nature was unparalleled. However, few are aware that Einstein was also a great inventor. He and his collaborators had patented a wide variety of inventions in several countries. After a brief description of Einstein's life, his collaborators, his inventions and patents will be discussed.
△ Less
Submitted 5 September, 2017; v1 submitted 3 September, 2017;
originally announced September 2017.
-
Learning to Blame: Localizing Novice Type Errors with Data-Driven Diagnosis
Authors:
Eric L. Seidel,
Huma Sibghat,
Kamalika Chaudhuri,
Westley Weimer,
Ranjit Jhala
Abstract:
Localizing type errors is challenging in languages with global type inference, as the type checker must make assumptions about what the programmer intended to do. We introduce Nate, a data-driven approach to error localization based on supervised learning. Nate analyzes a large corpus of training data -- pairs of ill-typed programs and their "fixed" versions -- to automatically learn a model of wh…
▽ More
Localizing type errors is challenging in languages with global type inference, as the type checker must make assumptions about what the programmer intended to do. We introduce Nate, a data-driven approach to error localization based on supervised learning. Nate analyzes a large corpus of training data -- pairs of ill-typed programs and their "fixed" versions -- to automatically learn a model of where the error is most likely to be found. Given a new ill-typed program, Nate executes the model to generate a list of potential blame assignments ranked by likelihood. We evaluate Nate by comparing its precision to the state of the art on a set of over 5,000 ill-typed OCaml programs drawn from two instances of an introductory programming course. We show that when the top-ranked blame assignment is considered, Nate's data-driven model is able to correctly predict the exact sub-expression that should be changed 72% of the time, 28 points higher than OCaml and 16 points higher than the state-of-the-art SHErrLoc tool. Furthermore, Nate's accuracy surpasses 85% when we consider the top two locations and reaches 91% if we consider the top three.
△ Less
Submitted 17 September, 2017; v1 submitted 24 August, 2017;
originally announced August 2017.
-
Composition Properties of Inferential Privacy for Time-Series Data
Authors:
Shuang Song,
Kamalika Chaudhuri
Abstract:
With the proliferation of mobile devices and the internet of things, develo** principled solutions for privacy in time series applications has become increasingly important. While differential privacy is the gold standard for database privacy, many time series applications require a different kind of guarantee, and a number of recent works have used some form of inferential privacy to address th…
▽ More
With the proliferation of mobile devices and the internet of things, develo** principled solutions for privacy in time series applications has become increasingly important. While differential privacy is the gold standard for database privacy, many time series applications require a different kind of guarantee, and a number of recent works have used some form of inferential privacy to address these situations.
However, a major barrier to using inferential privacy in practice is its lack of graceful composition -- even if the same or related sensitive data is used in multiple releases that are safe individually, the combined release may have poor privacy properties. In this paper, we study composition properties of a form of inferential privacy called Pufferfish when applied to time-series data. We show that while general Pufferfish mechanisms may not compose gracefully, a specific Pufferfish mechanism, called the Markov Quilt Mechanism, which was recently introduced, has strong composition properties comparable to that of pure differential privacy when applied to time series data.
△ Less
Submitted 10 July, 2017;
originally announced July 2017.
-
Analyzing the Robustness of Nearest Neighbors to Adversarial Examples
Authors:
Yizhen Wang,
Somesh Jha,
Kamalika Chaudhuri
Abstract:
Motivated by safety-critical applications, test-time attacks on classifiers via adversarial examples has recently received a great deal of attention. However, there is a general lack of understanding on why adversarial examples arise; whether they originate due to inherent properties of data or due to lack of training samples remains ill-understood. In this work, we introduce a theoretical framewo…
▽ More
Motivated by safety-critical applications, test-time attacks on classifiers via adversarial examples has recently received a great deal of attention. However, there is a general lack of understanding on why adversarial examples arise; whether they originate due to inherent properties of data or due to lack of training samples remains ill-understood. In this work, we introduce a theoretical framework analogous to bias-variance theory for understanding these effects.
We use our framework to analyze the robustness of a canonical non-parametric classifier - the k-nearest neighbors. Our analysis shows that its robustness properties depend critically on the value of k - the classifier may be inherently non-robust for small k, but its robustness approaches that of the Bayes Optimal classifier for fast-growing k. We propose a novel modified 1-nearest neighbor classifier, and guarantee its robustness in the large sample limit. Our experiments suggest that this classifier may have good robustness properties even for reasonable data set sizes.
△ Less
Submitted 18 June, 2019; v1 submitted 13 June, 2017;
originally announced June 2017.
-
Approximation and Convergence Properties of Generative Adversarial Learning
Authors:
Shuang Liu,
Olivier Bousquet,
Kamalika Chaudhuri
Abstract:
Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a "two-player game" between a generator and a discriminator. Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family aff…
▽ More
Generative adversarial networks (GAN) approximate a target data distribution by jointly optimizing an objective function through a "two-player game" between a generator and a discriminator. Despite their empirical success, however, two very basic questions on how well they can approximate the target distribution remain unanswered. First, it is not known how restricting the discriminator family affects the approximation quality. Second, while a number of different objective functions have been proposed, we do not understand when convergence to the global minima of the objective function leads to convergence to the target distribution under various notions of distributional convergence.
In this paper, we address these questions in a broad and unified setting by defining a notion of adversarial divergences that includes a number of recently proposed objective functions. We show that if the objective function is an adversarial divergence with some additional conditions, then using a restricted discriminator family has a moment-matching effect. Additionally, we show that for objective functions that are strict adversarial divergences, convergence in the objective function implies weak convergence, thus generalizing previous results.
△ Less
Submitted 24 May, 2017;
originally announced May 2017.
-
Variational Bayes In Private Settings (VIPS)
Authors:
Mijung Park,
James Foulds,
Kamalika Chaudhuri,
Max Welling
Abstract:
Many applications of Bayesian data analysis involve sensitive information, motivating methods which ensure that privacy is protected. We introduce a general privacy-preserving framework for Variational Bayes (VB), a widely used optimization-based Bayesian inference method. Our framework respects differential privacy, the gold-standard privacy criterion, and encompasses a large class of probabilist…
▽ More
Many applications of Bayesian data analysis involve sensitive information, motivating methods which ensure that privacy is protected. We introduce a general privacy-preserving framework for Variational Bayes (VB), a widely used optimization-based Bayesian inference method. Our framework respects differential privacy, the gold-standard privacy criterion, and encompasses a large class of probabilistic models, called the Conjugate Exponential (CE) family. We observe that we can straightforwardly privatise VB's approximate posterior distributions for models in the CE family, by perturbing the expected sufficient statistics of the complete-data likelihood. For a broadly-used class of non-CE models, those with binomial likelihoods, we show how to bring such models into the CE family, such that inferences in the modified model resemble the private variational Bayes algorithm as closely as possible, using the Polya-Gamma data augmentation scheme. The iterative nature of variational Bayes presents a further challenge since iterations increase the amount of noise needed. We overcome this by combining: (1) an improved composition method for differential privacy, called the moments accountant, which provides a tight bound on the privacy cost of multiple VB iterations and thus significantly decreases the amount of additive noise; and (2) the privacy amplification effect of subsampling mini-batches from large-scale data in stochastic learning. We empirically demonstrate the effectiveness of our method in CE and non-CE models including latent Dirichlet allocation, Bayesian logistic regression, and sigmoid belief networks, evaluated on real-world datasets.
△ Less
Submitted 3 December, 2018; v1 submitted 1 November, 2016;
originally announced November 2016.
-
Active Learning from Imperfect Labelers
Authors:
Songbai Yan,
Kamalika Chaudhuri,
Tara Javidi
Abstract:
We study active learning where the labeler can not only return incorrect labels but also abstain from labeling. We consider different noise and abstention conditions of the labeler. We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under fairly natural assumptions on the noise and abstention rate of the labeler. This algorithm…
▽ More
We study active learning where the labeler can not only return incorrect labels but also abstain from labeling. We consider different noise and abstention conditions of the labeler. We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under fairly natural assumptions on the noise and abstention rate of the labeler. This algorithm is adaptive in a sense that it can automatically request less queries with a more informed or less noisy labeler. We couple our algorithm with lower bounds to show that under some technical conditions, it achieves nearly optimal query complexity.
△ Less
Submitted 30 October, 2016;
originally announced October 2016.
-
Private Topic Modeling
Authors:
Mijung Park,
James Foulds,
Kamalika Chaudhuri,
Max Welling
Abstract:
We develop a privatised stochastic variational inference method for Latent Dirichlet Allocation (LDA). The iterative nature of stochastic variational inference presents challenges: multiple iterations are required to obtain accurate posterior distributions, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorith…
▽ More
We develop a privatised stochastic variational inference method for Latent Dirichlet Allocation (LDA). The iterative nature of stochastic variational inference presents challenges: multiple iterations are required to obtain accurate posterior distributions, yet each iteration increases the amount of noise that must be added to achieve a reasonable degree of privacy. We propose a practical algorithm that overcomes this challenge by combining: (1) an improved composition method for differential privacy, called the moments accountant, which provides a tight bound on the privacy cost of multiple variational inference iterations and thus significantly decreases the amount of additive noise; and (2) privacy amplification resulting from subsampling of large-scale data. Focusing on conjugate exponential family models, in our private variational inference, all the posterior distributions will be privatised by simply perturbing expected sufficient statistics. Using Wikipedia data, we illustrate the effectiveness of our algorithm for large-scale data.
△ Less
Submitted 3 December, 2018; v1 submitted 13 September, 2016;
originally announced September 2016.
-
Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics
Authors:
Xi Wu,
Fengan Li,
Arun Kumar,
Kamalika Chaudhuri,
Somesh Jha,
Jeffrey F. Naughton
Abstract:
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high develo…
▽ More
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address \emph{both} issues in an integrated manner. In contrast to the white-box approach adopted by previous work, we revisit and use the classical technique of {\em output perturbation} to devise a novel "bolt-on" approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the $L_2$-sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.
△ Less
Submitted 23 March, 2017; v1 submitted 15 June, 2016;
originally announced June 2016.
-
On Science, pseudoscience and String theory
Authors:
Asis Kumar Chaudhuri
Abstract:
The article discusses the demarcation problem; how to distinguish between science and pseudoscience. It then examines the string theory under various demarcation criteria to conclude that string theory cannot be considered as science.
The article discusses the demarcation problem; how to distinguish between science and pseudoscience. It then examines the string theory under various demarcation criteria to conclude that string theory cannot be considered as science.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
DP-EM: Differentially Private Expectation Maximization
Authors:
Mijung Park,
Jimmy Foulds,
Kamalika Chaudhuri,
Max Welling
Abstract:
The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently develope…
▽ More
The iterative nature of the expectation maximization (EM) algorithm presents a challenge for privacy-preserving estimation, as each iteration increases the amount of noise needed. We propose a practical private EM algorithm that overcomes this challenge using two innovations: (1) a novel moment perturbation formulation for differentially private EM (DP-EM), and (2) the use of two recently developed composition methods to bound the privacy "cost" of multiple EM iterations: the moments accountant (MA) and zero-mean concentrated differential privacy (zCDP). Both MA and zCDP bound the moment generating function of the privacy loss random variable and achieve a refined tail bound, which effectively decrease the amount of additive noise. We present empirical results showing the benefits of our approach, as well as similar performance between these two composition methods in the DP-EM setting for Gaussian mixture models. Our approach can be readily extended to many iterative learning algorithms, opening up various exciting future directions.
△ Less
Submitted 31 October, 2016; v1 submitted 23 May, 2016;
originally announced May 2016.
-
Gravitational Wave for a pedestrian
Authors:
A K Chaudhuri
Abstract:
The physics of gravitational wave and its detection in the recent experiment by the LIGO collaboration is discussed in simple terms for a general audience. The main article is devoid of any mathematics, but an appendix is included for inquisitive readers where essential mathematics for general theory of relativity and gravitational waves are given.
The physics of gravitational wave and its detection in the recent experiment by the LIGO collaboration is discussed in simple terms for a general audience. The main article is devoid of any mathematics, but an appendix is included for inquisitive readers where essential mathematics for general theory of relativity and gravitational waves are given.
△ Less
Submitted 5 May, 2016; v1 submitted 3 May, 2016;
originally announced May 2016.
-
The Extended Littlestone's Dimension for Learning with Mistakes and Abstentions
Authors:
Chicheng Zhang,
Kamalika Chaudhuri
Abstract:
This paper studies classification with an abstention option in the online setting. In this setting, examples arrive sequentially, the learner is given a hypothesis class $\mathcal H$, and the goal of the learner is to either predict a label on each example or abstain, while ensuring that it does not make more than a pre-specified number of mistakes when it does predict a label.
Previous work on…
▽ More
This paper studies classification with an abstention option in the online setting. In this setting, examples arrive sequentially, the learner is given a hypothesis class $\mathcal H$, and the goal of the learner is to either predict a label on each example or abstain, while ensuring that it does not make more than a pre-specified number of mistakes when it does predict a label.
Previous work on this problem has left open two main challenges. First, not much is known about the optimality of algorithms, and in particular, about what an optimal algorithmic strategy is for any individual hypothesis class. Second, while the realizable case has been studied, the more realistic non-realizable scenario is not well-understood. In this paper, we address both challenges. First, we provide a novel measure, called the Extended Littlestone's Dimension, which captures the number of abstentions needed to ensure a certain number of mistakes. Second, we explore the non-realizable case, and provide upper and lower bounds on the number of abstentions required by an algorithm to guarantee a specified number of mistakes.
△ Less
Submitted 28 September, 2016; v1 submitted 20 April, 2016;
originally announced April 2016.
-
On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis
Authors:
James Foulds,
Joseph Geumlek,
Max Welling,
Kamalika Chaudhuri
Abstract:
Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asy…
▽ More
Bayesian inference has great promise for the privacy-preserving analysis of sensitive data, as posterior sampling automatically preserves differential privacy, an algorithmic notion of data privacy, under certain conditions (Dimitrakakis et al., 2014; Wang et al., 2015). While this one posterior sample (OPS) approach elegantly provides privacy "for free," it is data inefficient in the sense of asymptotic relative efficiency (ARE). We show that a simple alternative based on the Laplace mechanism, the workhorse of differential privacy, is as asymptotically efficient as non-private posterior inference, under general assumptions. This technique also has practical advantages including efficient use of the privacy budget for MCMC. We demonstrate the practicality of our approach on a time-series analysis of sensitive military records from the Afghanistan and Iraq wars disclosed by the Wikileaks organization.
△ Less
Submitted 8 June, 2016; v1 submitted 23 March, 2016;
originally announced March 2016.
-
Pufferfish Privacy Mechanisms for Correlated Data
Authors:
Shuang Song,
Yizhen Wang,
Kamalika Chaudhuri
Abstract:
Many modern databases include personal and sensitive correlated data, such as private information on users connected together in a social network, and measurements of physical activity of single subjects across time. However, differential privacy, the current gold standard in data privacy, does not adequately address privacy issues in this kind of data.
This work looks at a recent generalization…
▽ More
Many modern databases include personal and sensitive correlated data, such as private information on users connected together in a social network, and measurements of physical activity of single subjects across time. However, differential privacy, the current gold standard in data privacy, does not adequately address privacy issues in this kind of data.
This work looks at a recent generalization of differential privacy, called Pufferfish, that can be used to address privacy in correlated data. The main challenge in applying Pufferfish is a lack of suitable mechanisms. We provide the first mechanism -- the Wasserstein Mechanism -- which applies to any general Pufferfish framework. Since this mechanism may be computationally inefficient, we provide an additional mechanism that applies to some practical cases such as physical activity measurements across time, and is computationally efficient. Our experimental evaluations indicate that this mechanism provides privacy and utility for synthetic as well as real data in two separate domains.
△ Less
Submitted 12 March, 2017; v1 submitted 12 March, 2016;
originally announced March 2016.
-
A Hybrid Linear Logic for Constrained Transition Systems
Authors:
Joelle Despeyroux,
Kaustuv Chaudhuri
Abstract:
Linear implication can represent state transitions, but real transition systems operate under temporal, stochastic or probabilistic constraints that are not directly representable in ordinary linear logic. We propose a general modal extension of intuitionistic linear logic where logical truth is indexed by constraints and hybrid connectives combine constraint reasoning with logical reasoning. Th…
▽ More
Linear implication can represent state transitions, but real transition systems operate under temporal, stochastic or probabilistic constraints that are not directly representable in ordinary linear logic. We propose a general modal extension of intuitionistic linear logic where logical truth is indexed by constraints and hybrid connectives combine constraint reasoning with logical reasoning. The logic has a focused cut-free sequent calculus that can be used to internalize the rules of particular constrained transition systems; we illustrate this with an adequate encoding of the synchronous stochastic pi-calculus.
△ Less
Submitted 8 March, 2016;
originally announced March 2016.
-
Convex Optimization For Non-Convex Problems via Column Generation
Authors:
Julian Yarkony,
Kamalika Chaudhuri
Abstract:
We apply column generation to approximating complex structured objects via a set of primitive structured objects under either the cross entropy or L2 loss. We use L1 regularization to encourage the use of few structured primitive objects. We attack approximation using convex optimization over an infinite number of variables each corresponding to a primitive structured object that are generated on…
▽ More
We apply column generation to approximating complex structured objects via a set of primitive structured objects under either the cross entropy or L2 loss. We use L1 regularization to encourage the use of few structured primitive objects. We attack approximation using convex optimization over an infinite number of variables each corresponding to a primitive structured object that are generated on demand by easy inference in the Lagrangian dual. We apply our approach to producing low rank approximations to large 3-way tensors.
△ Less
Submitted 13 February, 2016;
originally announced February 2016.
-
Much ado about Zero
Authors:
Asis Kumar Chaudhuri
Abstract:
A brief historical introduction for the enigmatic number Zero is given. The discussions are for popular consumption.
A brief historical introduction for the enigmatic number Zero is given. The discussions are for popular consumption.
△ Less
Submitted 7 June, 2016; v1 submitted 22 January, 2016;
originally announced January 2016.
-
Active Learning from Weak and Strong Labelers
Authors:
Chicheng Zhang,
Kamalika Chaudhuri
Abstract:
An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively query labels to an oracle of a subset of these examples; the goal of the learner is to learn a hypothesis in the class that fits the data well by making as few label queries as possible.
This work addresses active learning with labels obtained from strong and weak labelers, where in…
▽ More
An active learner is given a hypothesis class, a large set of unlabeled examples and the ability to interactively query labels to an oracle of a subset of these examples; the goal of the learner is to learn a hypothesis in the class that fits the data well by making as few label queries as possible.
This work addresses active learning with labels obtained from strong and weak labelers, where in addition to the standard active learning setting, we have an extra weak labeler which may occasionally provide incorrect labels. An example is learning to classify medical images where either expensive labels may be obtained from a physician (oracle or strong labeler), or cheaper but occasionally incorrect labels may be obtained from a medical resident (weak labeler). Our goal is to learn a classifier with low error on data labeled by the oracle, while using the weak labeler to reduce the number of label queries made to this labeler. We provide an active learning algorithm for this setting, establish its statistical consistency, and analyze its label complexity to characterize when it can provide label savings over using the strong labeler alone.
△ Less
Submitted 15 October, 2015; v1 submitted 9 October, 2015;
originally announced October 2015.
-
Proceedings Tenth International Workshop on Logical Frameworks and Meta Languages: Theory and Practice
Authors:
Iliano Cervesato,
Kaustuv Chaudhuri
Abstract:
This volume constitutes the proceedings of LFMTP 2015, the Tenth International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice, held on August 1st, 2015 in Berlin, Germany. The workshop was a one-day satellite event of CADE-25, the 25th International Conference on Automated Deduction. Logical frameworks and meta-languages form a common substrate for representing, implementi…
▽ More
This volume constitutes the proceedings of LFMTP 2015, the Tenth International Workshop on Logical Frameworks and Meta-Languages: Theory and Practice, held on August 1st, 2015 in Berlin, Germany. The workshop was a one-day satellite event of CADE-25, the 25th International Conference on Automated Deduction. Logical frameworks and meta-languages form a common substrate for representing, implementing, and reasoning about a wide variety of deductive systems of interest in logic and computer science. Their design and implementation and their use in reasoning tasks ranging from the correctness of software to the properties of formal computational systems have been the focus of considerable research over the last two decades. This workshop brought together designers, implementors, and practitioners to discuss various aspects im**ing on the structure and utility of logical frameworks, including the treatment of variable binding, inductive and co-inductive reasoning techniques and the expressiveness and lucidity of the reasoning process.
△ Less
Submitted 27 July, 2015;
originally announced July 2015.
-
Fluctuations in slope parameter in event-by-event hydrodynamics and momentum anisotropy in heavy ion collisions
Authors:
A. K. Chaudhuri
Abstract:
In event by event hydrodynamic model, we have simulated 30-40\% Au+Au collisions at RHIC and computed the slope parameter from the invariant pion distribution. In each event, the slope parameter fluctuates azimuthally. Fourier expansion coefficients $T_n$ for the slope parameter and the Fourier expansion coefficients $v_n$ for the azimuthal distribution $\frac{dN}{dφ}$ are found to be strongly cor…
▽ More
In event by event hydrodynamic model, we have simulated 30-40\% Au+Au collisions at RHIC and computed the slope parameter from the invariant pion distribution. In each event, the slope parameter fluctuates azimuthally. Fourier expansion coefficients $T_n$ for the slope parameter and the Fourier expansion coefficients $v_n$ for the azimuthal distribution $\frac{dN}{dφ}$ are found to be strongly correlated. Strong correlation between the two expansion coefficients suggests that in addition to azimuthal distribution, fluctuations in the slope parameter of the invariant distribution can as well be used to study the final state momentum anisotropy in relativistic energy heavy ion collisions. If measured experimentally, they can serve as additional constraint for hydrodynamical modeling.
△ Less
Submitted 17 July, 2015;
originally announced July 2015.
-
Convergence Rates of Active Learning for Maximum Likelihood Estimation
Authors:
Kamalika Chaudhuri,
Sham Kakade,
Praneeth Netrapalli,
Sujay Sanghavi
Abstract:
An active learner is given a class of models, a large set of unlabeled examples, and the ability to interactively query labels of a subset of these examples; the goal of the learner is to learn a model in the class that fits the data well.
Previous theoretical work has rigorously characterized label complexity of active learning, but most of this work has focused on the PAC or the agnostic PAC m…
▽ More
An active learner is given a class of models, a large set of unlabeled examples, and the ability to interactively query labels of a subset of these examples; the goal of the learner is to learn a model in the class that fits the data well.
Previous theoretical work has rigorously characterized label complexity of active learning, but most of this work has focused on the PAC or the agnostic PAC model. In this paper, we shift our attention to a more general setting -- maximum likelihood estimation. Provided certain conditions hold on the model class, we provide a two-stage active learning algorithm for this problem. The conditions we require are fairly general, and cover the widely popular class of Generalized Linear Models, which in turn, include models for binary and multi-class classification, regression, and conditional random fields.
We provide an upper bound on the label requirement of our algorithm, and a lower bound that matches it up to lower order terms. Our analysis shows that unlike binary classification in the realizable case, just a single extra round of interaction is sufficient to achieve near-optimal performance in maximum likelihood estimation. On the empirical side, the recent work in ~\cite{Zhang12} and~\cite{Zhang14} (on active linear and logistic regression) shows the promise of this approach.
△ Less
Submitted 8 June, 2015;
originally announced June 2015.
-
Spectral Learning of Large Structured HMMs for Comparative Epigenomics
Authors:
Chicheng Zhang,
Jimin Song,
Kevin C Chen,
Kamalika Chaudhuri
Abstract:
We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM); we model the relationship between multiple cell types by connecting their hidden states by a fixed tree of known structure. The main cha…
▽ More
We develop a latent variable model and an efficient spectral algorithm motivated by the recent emergence of very large data sets of chromatin marks from multiple human cell types. A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM); we model the relationship between multiple cell types by connecting their hidden states by a fixed tree of known structure. The main challenge with learning parameters of such models is that iterative methods such as EM are very slow, while naive spectral methods result in time and space complexity exponential in the number of cell types. We exploit properties of the tree structure of the hidden states to provide spectral algorithms that are more computationally efficient for current biological datasets. We provide sample complexity bounds for our algorithm and evaluate it experimentally on biological data from nine human cell types. Finally, we show that beyond our specific model, some of our algorithmic ideas can be applied to other graphical models.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.
-
Crowdsourcing Feature Discovery via Adaptively Chosen Comparisons
Authors:
James Y. Zou,
Kamalika Chaudhuri,
Adam Tauman Kalai
Abstract:
We introduce an unsupervised approach to efficiently discover the underlying features in a data set via crowdsourcing. Our queries ask crowd members to articulate a feature common to two out of three displayed examples. In addition we also ask the crowd to provide binary labels to the remaining examples based on the discovered features. The triples are chosen adaptively based on the labels of the…
▽ More
We introduce an unsupervised approach to efficiently discover the underlying features in a data set via crowdsourcing. Our queries ask crowd members to articulate a feature common to two out of three displayed examples. In addition we also ask the crowd to provide binary labels to the remaining examples based on the discovered features. The triples are chosen adaptively based on the labels of the previously discovered features on the data set. In two natural models of features, hierarchical and independent, we show that a simple adaptive algorithm, using "two-out-of-three" similarity queries, recovers all features with less labor than any nonadaptive algorithm. Experimental results validate the theoretical findings.
△ Less
Submitted 31 March, 2015;
originally announced April 2015.
-
Undecidability of Multiplicative Subexponential Logic
Authors:
Kaustuv Chaudhuri
Abstract:
Subexponential logic is a variant of linear logic with a family of exponential connectives--called subexponentials--that are indexed and arranged in a pre-order. Each subexponential has or lacks associated structural properties of weakening and contraction. We show that classical propositional multiplicative linear logic extended with one unrestricted and two incomparable linear subexponentials…
▽ More
Subexponential logic is a variant of linear logic with a family of exponential connectives--called subexponentials--that are indexed and arranged in a pre-order. Each subexponential has or lacks associated structural properties of weakening and contraction. We show that classical propositional multiplicative linear logic extended with one unrestricted and two incomparable linear subexponentials can encode the halting problem for two register Minsky machines, and is hence undecidable.
△ Less
Submitted 16 February, 2015;
originally announced February 2015.