-
Privacy-Preserving Federated Learning over Vertically and Horizontally Partitioned Data for Financial Anomaly Detection
Authors:
Swanand Ravindra Kadhe,
Heiko Ludwig,
Nathalie Baracaldo,
Alan King,
Yi Zhou,
Keith Houck,
Ambrish Rawat,
Mark Purcell,
Naoise Holohan,
Mikio Takeuchi,
Ryo Kawahara,
Nir Drucker,
Hayim Shaul,
Eyal Kushnir,
Omri Soceanu
Abstract:
The effective detection of evidence of financial anomalies requires collaboration among multiple entities who own a diverse set of data, such as a payment network system (PNS) and its partner banks. Trust among these financial institutions is limited by regulation and competition. Federated learning (FL) enables entities to collaboratively train a model when data is either vertically or horizontal…
▽ More
The effective detection of evidence of financial anomalies requires collaboration among multiple entities who own a diverse set of data, such as a payment network system (PNS) and its partner banks. Trust among these financial institutions is limited by regulation and competition. Federated learning (FL) enables entities to collaboratively train a model when data is either vertically or horizontally partitioned across the entities. However, in real-world financial anomaly detection scenarios, the data is partitioned both vertically and horizontally and hence it is not possible to use existing FL approaches in a plug-and-play manner.
Our novel solution, PV4FAD, combines fully homomorphic encryption (HE), secure multi-party computation (SMPC), differential privacy (DP), and randomization techniques to balance privacy and accuracy during training and to prevent inference threats at model deployment time. Our solution provides input privacy through HE and SMPC, and output privacy against inference time attacks through DP. Specifically, we show that, in the honest-but-curious threat model, banks do not learn any sensitive features about PNS transactions, and the PNS does not learn any information about the banks' dataset but only learns prediction labels. We also develop and analyze a DP mechanism to protect output privacy during inference. Our solution generates high-utility models by significantly reducing the per-bank noise level while satisfying distributed DP. To ensure high accuracy, our approach produces an ensemble model, in particular, a random forest. This enables us to take advantage of the well-known properties of ensembles to reduce variance and increase accuracy. Our solution won second prize in the first phase of the U.S. Privacy Enhancing Technologies (PETs) Prize Challenge.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Random Number Generators and Seeding for Differential Privacy
Authors:
Naoise Holohan
Abstract:
Differential Privacy (DP) relies on random numbers to preserve privacy, typically utilising Pseudorandom Number Generators (PRNGs) as a source of randomness. In order to allow for consistent reproducibility, testing and bug-fixing in DP algorithms and results, it is important to allow for the seeding of the PRNGs used therein. In this work, we examine the landscape of Random Number Generators (RNG…
▽ More
Differential Privacy (DP) relies on random numbers to preserve privacy, typically utilising Pseudorandom Number Generators (PRNGs) as a source of randomness. In order to allow for consistent reproducibility, testing and bug-fixing in DP algorithms and results, it is important to allow for the seeding of the PRNGs used therein. In this work, we examine the landscape of Random Number Generators (RNGs), and the considerations software engineers should make when choosing and seeding a PRNG for DP. We hope it serves as a suitable guide for DP practitioners, and includes many lessons learned when implementing seeding for diffprivlib.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Secure k-Anonymization over Encrypted Databases
Authors:
Manish Kesarwani,
Akshar Kaul,
Stefano Braghin,
Naoise Holohan,
Spiros Antonatos
Abstract:
Data protection algorithms are becoming increasingly important to support modern business needs for facilitating data sharing and data monetization. Anonymization is an important step before data sharing. Several organizations leverage on third parties for storing and managing data. However, third parties are often not trusted to store plaintext personal and sensitive data; data encryption is wide…
▽ More
Data protection algorithms are becoming increasingly important to support modern business needs for facilitating data sharing and data monetization. Anonymization is an important step before data sharing. Several organizations leverage on third parties for storing and managing data. However, third parties are often not trusted to store plaintext personal and sensitive data; data encryption is widely adopted to protect against intentional and unintentional attempts to read personal/sensitive data. Traditional encryption schemes do not support operations over the ciphertexts and thus anonymizing encrypted datasets is not feasible with current approaches. This paper explores the feasibility and depth of implementing a privacy-preserving data publishing workflow over encrypted datasets leveraging on homomorphic encryption. We demonstrate how we can achieve uniqueness discovery, data masking, differential privacy and k-anonymity over encrypted data requiring zero knowledge about the original values. We prove that the security protocols followed by our approach provide strong guarantees against inference attacks. Finally, we experimentally demonstrate the performance of our data publishing workflow components.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
Secure Random Sampling in Differential Privacy
Authors:
Naoise Holohan,
Stefano Braghin
Abstract:
Differential privacy is among the most prominent techniques for preserving privacy of sensitive data, oweing to its robust mathematical guarantees and general applicability to a vast array of computations on data, including statistical analysis and machine learning. Previous work demonstrated that concrete implementations of differential privacy mechanisms are vulnerable to statistical attacks. Th…
▽ More
Differential privacy is among the most prominent techniques for preserving privacy of sensitive data, oweing to its robust mathematical guarantees and general applicability to a vast array of computations on data, including statistical analysis and machine learning. Previous work demonstrated that concrete implementations of differential privacy mechanisms are vulnerable to statistical attacks. This vulnerability is caused by the approximation of real values to floating point numbers. This paper presents a practical solution to the finite-precision floating point vulnerability, where the inverse transform sampling of the Laplace distribution can itself be inverted, thus enabling an attack where the original value can be retrieved with non-negligible advantage. The proposed solution has the advantages of being generalisable to any infinitely divisible probability distribution, and of simple implementation in modern architectures. Finally, the solution has been designed to make side channel attack infeasible, because of inherently exponential, in the size of the domain, brute force attacks.
△ Less
Submitted 24 November, 2021; v1 submitted 21 July, 2021;
originally announced July 2021.
-
IBM Federated Learning: an Enterprise Framework White Paper V0.1
Authors:
Heiko Ludwig,
Nathalie Baracaldo,
Gegi Thomas,
Yi Zhou,
Ali Anwar,
Shashank Rajamoni,
Yuya Ong,
Jayaram Radhakrishnan,
Ashish Verma,
Mathieu Sinn,
Mark Purcell,
Ambrish Rawat,
Tran Minh,
Naoise Holohan,
Supriyo Chakraborty,
Shalisha Whitherspoon,
Dean Steuer,
Laura Wynter,
Hifaz Hassan,
Sean Laguna,
Mikhail Yurochkin,
Mayank Agarwal,
Ebube Chuba,
Annie Abay
Abstract:
Federated Learning (FL) is an approach to conduct machine learning without centralizing training data in a single place, for reasons of privacy, confidentiality or data volume. However, solving federated machine learning problems raises issues above and beyond those of centralized machine learning. These issues include setting up communication infrastructure between parties, coordinating the learn…
▽ More
Federated Learning (FL) is an approach to conduct machine learning without centralizing training data in a single place, for reasons of privacy, confidentiality or data volume. However, solving federated machine learning problems raises issues above and beyond those of centralized machine learning. These issues include setting up communication infrastructure between parties, coordinating the learning process, integrating party results, understanding the characteristics of the training data sets of different participating parties, handling data heterogeneity, and operating with the absence of a verification data set.
IBM Federated Learning provides infrastructure and coordination for federated learning. Data scientists can design and run federated learning jobs based on existing, centralized machine learning models and can provide high-level instructions on how to run the federation. The framework applies to both Deep Neural Networks as well as ``traditional'' approaches for the most common machine learning libraries. {\proj} enables data scientists to expand their scope from centralized to federated machine learning, minimizing the learning curve at the outset while also providing the flexibility to deploy to different compute environments and design custom fusion algorithms.
△ Less
Submitted 22 July, 2020;
originally announced July 2020.
-
Diffprivlib: The IBM Differential Privacy Library
Authors:
Naoise Holohan,
Stefano Braghin,
Pól Mac Aonghusa,
Killian Levacher
Abstract:
Since its conception in 2006, differential privacy has emerged as the de-facto standard in data privacy, owing to its robust mathematical guarantees, generalised applicability and rich body of literature. Over the years, researchers have studied differential privacy and its applicability to an ever-widening field of topics. Mechanisms have been created to optimise the process of achieving differen…
▽ More
Since its conception in 2006, differential privacy has emerged as the de-facto standard in data privacy, owing to its robust mathematical guarantees, generalised applicability and rich body of literature. Over the years, researchers have studied differential privacy and its applicability to an ever-widening field of topics. Mechanisms have been created to optimise the process of achieving differential privacy, for various data types and scenarios. Until this work however, all previous work on differential privacy has been conducted on a ad-hoc basis, without a single, unifying codebase to implement results.
In this work, we present the IBM Differential Privacy Library, a general purpose, open source library for investigating, experimenting and develo** differential privacy applications in the Python programming language. The library includes a host of mechanisms, the building blocks of differential privacy, alongside a number of applications to machine learning and other data analytics tasks. Simplicity and accessibility has been prioritised in develo** the library, making it suitable to a wide audience of users, from those using the library for their first investigations in data privacy, to the privacy experts looking to contribute their own models and mechanisms for others to use.
△ Less
Submitted 4 July, 2019;
originally announced July 2019.
-
AnonTokens: tracing re-identification attacks through decoy records
Authors:
Spiros Antonatos,
Stefano Braghin,
Naoise Holohan,
Pol MacAonghusa
Abstract:
Privacy is of the utmost concern when it comes to releasing data to third parties. Data owners rely on anonymization approaches to safeguard the released datasets against re-identification attacks. However, even with strict anonymization in place, re-identification attacks are still a possibility and in many cases a reality. Prior art has focused on providing better anonymization algorithms with m…
▽ More
Privacy is of the utmost concern when it comes to releasing data to third parties. Data owners rely on anonymization approaches to safeguard the released datasets against re-identification attacks. However, even with strict anonymization in place, re-identification attacks are still a possibility and in many cases a reality. Prior art has focused on providing better anonymization algorithms with minimal loss of information and how to prevent data disclosure attacks. Our approach tries to tackle the issue of tracing re-identification attacks based on the concept of honeytokens, decoy or "bait" records with the goal to lure malicious users. While the concept of honeytokens has been widely used in the security domain, this is the first approach to apply the concept on the data privacy domain. Records with high re-identification risk, called AnonTokens, are inserted into anonymized datasets. This work demonstrates the feasibility, detectability and usability of AnonTokens and provides promising results for data owners who want to apply our approach to real use cases. We evaluated our concept with real large-scale population datasets. The results show that the introduction of decoy tokens is feasible without significant impact on the released dataset.
△ Less
Submitted 24 June, 2019;
originally announced June 2019.
-
The Bounded Laplace Mechanism in Differential Privacy
Authors:
Naoise Holohan,
Spiros Antonatos,
Stefano Braghin,
Pól Mac Aonghusa
Abstract:
The Laplace mechanism is the workhorse of differential privacy, applied to many instances where numerical data is processed. However, the Laplace mechanism can return semantically impossible values, such as negative counts, due to its infinite support. There are two popular solutions to this: (i) bounding/cap** the output values and (ii) bounding the mechanism support. In this paper, we show tha…
▽ More
The Laplace mechanism is the workhorse of differential privacy, applied to many instances where numerical data is processed. However, the Laplace mechanism can return semantically impossible values, such as negative counts, due to its infinite support. There are two popular solutions to this: (i) bounding/cap** the output values and (ii) bounding the mechanism support. In this paper, we show that bounding the mechanism support, while using the parameters of the pure Laplace mechanism, does not typically preserve differential privacy. We also present a robust method to compute the optimal mechanism parameters to achieve differential privacy in such a setting.
△ Less
Submitted 30 August, 2018;
originally announced August 2018.
-
($k$,$ε$)-Anonymity: $k$-Anonymity with $ε$-Differential Privacy
Authors:
Naoise Holohan,
Spiros Antonatos,
Stefano Braghin,
Pól Mac Aonghusa
Abstract:
The explosion in volume and variety of data offers enormous potential for research and commercial use. Increased availability of personal data is of particular interest in enabling highly customised services tuned to individual needs. Preserving the privacy of individuals against reidentification attacks in this fast-moving ecosystem poses significant challenges for a one-size fits all approach to…
▽ More
The explosion in volume and variety of data offers enormous potential for research and commercial use. Increased availability of personal data is of particular interest in enabling highly customised services tuned to individual needs. Preserving the privacy of individuals against reidentification attacks in this fast-moving ecosystem poses significant challenges for a one-size fits all approach to anonymisation.
In this paper we present ($k$,$ε$)-anonymisation, an approach that combines the $k$-anonymisation and $ε$-differential privacy models into a single coherent framework, providing privacy guarantees at least as strong as those offered by the individual models. Linking risks of less than 5\% are observed in experimental results, even with modest values of $k$ and $ε$.
Our approach is shown to address well-known limitations of $k$-anonymity and $ε$-differential privacy and is validated in an extensive experimental campaign using openly available datasets.
△ Less
Submitted 4 October, 2017;
originally announced October 2017.
-
Optimal Differentially Private Mechanisms for Randomised Response
Authors:
Naoise Holohan,
Douglas J. Leith,
Oliver Mason
Abstract:
We examine a generalised Randomised Response (RR) technique in the context of differential privacy and examine the optimality of such mechanisms. Strict and relaxed differential privacy are considered for binary outputs. By examining the error of a statistical estimator, we present closed solutions for the optimal mechanism(s) in both cases. The optimal mechanism is also given for the specific cas…
▽ More
We examine a generalised Randomised Response (RR) technique in the context of differential privacy and examine the optimality of such mechanisms. Strict and relaxed differential privacy are considered for binary outputs. By examining the error of a statistical estimator, we present closed solutions for the optimal mechanism(s) in both cases. The optimal mechanism is also given for the specific case of the original RR technique as introduced by Warner in 1965.
△ Less
Submitted 16 December, 2016;
originally announced December 2016.
-
Extreme Points of the Local Differential Privacy Polytope
Authors:
Naoise Holohan,
Douglas J. Leith,
Oliver Mason
Abstract:
We study the convex polytope of n x n stochastic matrices that define locally differentially private mechanisms. We first present invariance properties of the polytope and results reducing the number of constraints needed to define it. Our main results concern the extreme points of the polytope. In particular, we completely characterise these for matrices with 1, 2 or n non-zero columns.
We study the convex polytope of n x n stochastic matrices that define locally differentially private mechanisms. We first present invariance properties of the polytope and results reducing the number of constraints needed to define it. Our main results concern the extreme points of the polytope. In particular, we completely characterise these for matrices with 1, 2 or n non-zero columns.
△ Less
Submitted 18 May, 2016;
originally announced May 2016.
-
Differentially Private Response Mechanisms on Categorical Data
Authors:
Naoise Holohan,
Doug Leith,
Oliver Mason
Abstract:
We study mechanisms for differential privacy on finite datasets. By deriving \emph{sufficient sets} for differential privacy we obtain necessary and sufficient conditions for differential privacy, a tight lower bound on the maximal expected error of a discrete mechanism and a characterisation of the optimal mechanism which minimises the maximal expected error within the class of mechanisms conside…
▽ More
We study mechanisms for differential privacy on finite datasets. By deriving \emph{sufficient sets} for differential privacy we obtain necessary and sufficient conditions for differential privacy, a tight lower bound on the maximal expected error of a discrete mechanism and a characterisation of the optimal mechanism which minimises the maximal expected error within the class of mechanisms considered.
△ Less
Submitted 27 May, 2015;
originally announced May 2015.
-
Differential Privacy in Metric Spaces: Numerical, Categorical and Functional Data Under the One Roof
Authors:
Naoise Holohan,
Douglas Leith,
Oliver Mason
Abstract:
We study Differential Privacy in the abstract setting of Probability on metric spaces. Numerical, categorical and functional data can be handled in a uniform manner in this setting. We demonstrate how mechanisms based on data sanitisation and those that rely on adding noise to query responses fit within this framework. We prove that once the sanitisation is differentially private, then so is the q…
▽ More
We study Differential Privacy in the abstract setting of Probability on metric spaces. Numerical, categorical and functional data can be handled in a uniform manner in this setting. We demonstrate how mechanisms based on data sanitisation and those that rely on adding noise to query responses fit within this framework. We prove that once the sanitisation is differentially private, then so is the query response for any query. We show how to construct sanitisations for high-dimensional databases using simple 1-dimensional mechanisms. We also provide lower bounds on the expected error for differentially private sanitisations in the general metric space setting. Finally, we consider the question of sufficient sets for differential privacy and show that for relaxed differential privacy, any algebra generating the Borel $σ$-algebra is a sufficient set for relaxed differential privacy.
△ Less
Submitted 25 February, 2014;
originally announced February 2014.