-
Privacy-Preserving Collaborative Prediction using Random Forests
Authors:
Irene Giacomelli,
Somesh Jha,
Ross Kleiman,
David Page,
Kyonghwan Yoon
Abstract:
We study the problem of privacy-preserving machine learning (PPML) for ensemble methods, focusing our effort on random forests. In collaborative analysis, PPML attempts to solve the conflict between the need for data sharing and privacy. This is especially important in privacy sensitive applications such as learning predictive models for clinical decision support from EHR data from different clini…
▽ More
We study the problem of privacy-preserving machine learning (PPML) for ensemble methods, focusing our effort on random forests. In collaborative analysis, PPML attempts to solve the conflict between the need for data sharing and privacy. This is especially important in privacy sensitive applications such as learning predictive models for clinical decision support from EHR data from different clinics, where each clinic has a responsibility for its patients' privacy. We propose a new approach for ensemble methods: each entity learns a model, from its own data, and then when a client asks the prediction for a new private instance, the answers from all the locally trained models are used to compute the prediction in such a way that no extra information is revealed. We implement this approach for random forests and we demonstrate its high efficiency and potential accuracy benefit via experiments on real-world datasets, including actual EHR data.
△ Less
Submitted 21 November, 2018;
originally announced November 2018.
-
Exploring Connections Between Active Learning and Model Extraction
Authors:
Varun Chandrasekaran,
Kamalika Chaudhuri,
Irene Giacomelli,
Somesh Jha,
Songbai Yan
Abstract:
Machine learning is being increasingly used by individuals, research institutions, and corporations. This has resulted in the surge of Machine Learning-as-a-Service (MLaaS) - cloud services that provide (a) tools and resources to learn the model, and (b) a user-friendly query interface to access the model. However, such MLaaS systems raise privacy concerns such as model extraction. In model extrac…
▽ More
Machine learning is being increasingly used by individuals, research institutions, and corporations. This has resulted in the surge of Machine Learning-as-a-Service (MLaaS) - cloud services that provide (a) tools and resources to learn the model, and (b) a user-friendly query interface to access the model. However, such MLaaS systems raise privacy concerns such as model extraction. In model extraction attacks, adversaries maliciously exploit the query interface to steal the model. More precisely, in a model extraction attack, a good approximation of a sensitive or proprietary model held by the server is extracted (i.e. learned) by a dishonest user who interacts with the server only via the query interface. This attack was introduced by Tramer et al. at the 2016 USENIX Security Symposium, where practical attacks for various models were shown. We believe that better understanding the efficacy of model extraction attacks is paramount to designing secure MLaaS systems. To that end, we take the first step by (a) formalizing model extraction and discussing possible defense strategies, and (b) drawing parallels between model extraction and established area of active learning. In particular, we show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks, and investigate possible defense strategies.
△ Less
Submitted 19 November, 2019; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Privacy Risk in Machine Learning: Analyzing the Connection to Overfitting
Authors:
Samuel Yeom,
Irene Giacomelli,
Matt Fredrikson,
Somesh Jha
Abstract:
Machine learning algorithms, when applied to sensitive data, pose a distinct threat to privacy. A growing body of prior work demonstrates that models produced by these algorithms may leak specific private information in the training data to an attacker, either through the models' structure or their observable behavior. However, the underlying cause of this privacy risk is not well understood beyon…
▽ More
Machine learning algorithms, when applied to sensitive data, pose a distinct threat to privacy. A growing body of prior work demonstrates that models produced by these algorithms may leak specific private information in the training data to an attacker, either through the models' structure or their observable behavior. However, the underlying cause of this privacy risk is not well understood beyond a handful of anecdotal accounts that suggest overfitting and influence might play a role.
This paper examines the effect that overfitting and influence have on the ability of an attacker to learn information about the training data from machine learning models, either through training set membership inference or attribute inference attacks. Using both formal and empirical analyses, we illustrate a clear relationship between these factors and the privacy risk that arises in several popular machine learning algorithms. We find that overfitting is sufficient to allow an attacker to perform membership inference and, when the target attribute meets certain conditions about its influence, attribute inference attacks. Interestingly, our formal analysis also shows that overfitting is not necessary for these attacks and begins to shed light on what other factors may be in play. Finally, we explore the connection between membership inference and attribute inference, showing that there are deep connections between the two that lead to effective new attacks.
△ Less
Submitted 4 May, 2018; v1 submitted 5 September, 2017;
originally announced September 2017.
-
Improved Decoding Algorithms for Reed-Solomon Codes
Authors:
Irene Giacomelli
Abstract:
In coding theory, Reed-Solomon codes are one of the most well-known and widely used classes of error-correcting codes. In this thesis we study and compare two major strategies known for their decoding procedure, the Peterson-Gorenstein-Zierler (PGZ) and the Berlekamp-Massey (BM) decoder, in order to improve existing decoding algorithms and propose faster new ones. In particular we study a modified…
▽ More
In coding theory, Reed-Solomon codes are one of the most well-known and widely used classes of error-correcting codes. In this thesis we study and compare two major strategies known for their decoding procedure, the Peterson-Gorenstein-Zierler (PGZ) and the Berlekamp-Massey (BM) decoder, in order to improve existing decoding algorithms and propose faster new ones. In particular we study a modified version of the PGZ decoder, which we will call the fast Peterson-Gorenstein-Zierler (fPGZ) decoding algorithm. This improvement was presented in 1997 by exploiting the Hankel structure of the syndrome matrix. In this thesis we show that the fPGZ decoding algorithm can be seen as a particular case of the BM one. Indeed we prove that the intermediate outcomes obtained in the implementation of fPGZ are a subset of those of the BM decoding algorithm. In this way, we also uncover the existing relationship between the leading principal minors of syndrome matrix and the discrepancies computed by the BM algorithm. Finally, thanks to the study done on the structure of the syndrome matrix and its leading principal minors, we improve the error value computation in both the decoding strategies studied (specifically we prove new error value formulas for the fPGZ and the BM decoding algorithm) and moreover we state a new iterative formulation of the PGZ decoder well suited to a parallel implementation on integrated microchips. Thus using techniques of linear algebra we obtain a parallel decoding algorithm for Reed-Solomon codes with an O(e) computational time complexity, where e is the number of errors which occurred, although a fairly large number of elementary circuit elements is needed.
△ Less
Submitted 8 October, 2013;
originally announced October 2013.