-
Privacy-Preserving Federated Learning over Vertically and Horizontally Partitioned Data for Financial Anomaly Detection
Authors:
Swanand Ravindra Kadhe,
Heiko Ludwig,
Nathalie Baracaldo,
Alan King,
Yi Zhou,
Keith Houck,
Ambrish Rawat,
Mark Purcell,
Naoise Holohan,
Mikio Takeuchi,
Ryo Kawahara,
Nir Drucker,
Hayim Shaul,
Eyal Kushnir,
Omri Soceanu
Abstract:
The effective detection of evidence of financial anomalies requires collaboration among multiple entities who own a diverse set of data, such as a payment network system (PNS) and its partner banks. Trust among these financial institutions is limited by regulation and competition. Federated learning (FL) enables entities to collaboratively train a model when data is either vertically or horizontal…
▽ More
The effective detection of evidence of financial anomalies requires collaboration among multiple entities who own a diverse set of data, such as a payment network system (PNS) and its partner banks. Trust among these financial institutions is limited by regulation and competition. Federated learning (FL) enables entities to collaboratively train a model when data is either vertically or horizontally partitioned across the entities. However, in real-world financial anomaly detection scenarios, the data is partitioned both vertically and horizontally and hence it is not possible to use existing FL approaches in a plug-and-play manner.
Our novel solution, PV4FAD, combines fully homomorphic encryption (HE), secure multi-party computation (SMPC), differential privacy (DP), and randomization techniques to balance privacy and accuracy during training and to prevent inference threats at model deployment time. Our solution provides input privacy through HE and SMPC, and output privacy against inference time attacks through DP. Specifically, we show that, in the honest-but-curious threat model, banks do not learn any sensitive features about PNS transactions, and the PNS does not learn any information about the banks' dataset but only learns prediction labels. We also develop and analyze a DP mechanism to protect output privacy during inference. Our solution generates high-utility models by significantly reducing the per-bank noise level while satisfying distributed DP. To ensure high accuracy, our approach produces an ensemble model, in particular, a random forest. This enables us to take advantage of the well-known properties of ensembles to reduce variance and increase accuracy. Our solution won second prize in the first phase of the U.S. Privacy Enhancing Technologies (PETs) Prize Challenge.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
A Tutorial Introduction to Lattice-based Cryptography and Homomorphic Encryption
Authors:
Yang Li,
Kee Siong Ng,
Michael Purcell
Abstract:
Why study Lattice-based Cryptography? There are a few ways to answer this question. 1. It is useful to have cryptosystems that are based on a variety of hard computational problems so the different cryptosystems are not all vulnerable in the same way. 2. The computational aspects of lattice-based cryptosystem are usually simple to understand and fairly easy to implement in practice. 3. Lattice-bas…
▽ More
Why study Lattice-based Cryptography? There are a few ways to answer this question. 1. It is useful to have cryptosystems that are based on a variety of hard computational problems so the different cryptosystems are not all vulnerable in the same way. 2. The computational aspects of lattice-based cryptosystem are usually simple to understand and fairly easy to implement in practice. 3. Lattice-based cryptosystems have lower encryption/decryption computational complexities compared to popular cryptosystems that are based on the integer factorisation or the discrete logarithm problems. 4. Lattice-based cryptosystems enjoy strong worst-case hardness security proofs based on approximate versions of known NP-hard lattice problems. 5. Lattice-based cryptosystems are believed to be good candidates for post-quantum cryptography, since there are currently no known quantum algorithms for solving lattice problems that perform significantly better than the best-known classical (non-quantum) algorithms, unlike for integer factorisation and (elliptic curve) discrete logarithm problems. 6. Last but not least, interesting structures in lattice problems have led to significant advances in Homomorphic Encryption, a new research area with wide-ranging applications.
△ Less
Submitted 28 September, 2022; v1 submitted 17 August, 2022;
originally announced August 2022.
-
Adaptive Aggregation For Federated Learning
Authors:
K. R. Jayaram,
Vinod Muthusamy,
Gegi Thomas,
Ashish Verma,
Mark Purcell
Abstract:
Advances in federated learning (FL) algorithms,along with technologies like differential privacy and homomorphic encryption, have led to FL being increasingly adopted and used in many application domains. This increasing adoption has led to rapid growth in the number, size (number of participants/parties) and diversity (intermittent vs. active parties) of FL jobs. Many existing FL systems, based o…
▽ More
Advances in federated learning (FL) algorithms,along with technologies like differential privacy and homomorphic encryption, have led to FL being increasingly adopted and used in many application domains. This increasing adoption has led to rapid growth in the number, size (number of participants/parties) and diversity (intermittent vs. active parties) of FL jobs. Many existing FL systems, based on centralized (often single) model aggregators are unable to scale to handle large FL jobs and adapt to parties' behavior.
In this paper, we present a new scalable and adaptive architecture for FL aggregation. First, we demonstrate how traditional tree overlay based aggregation techniques (from P2P, publish-subscribe and stream processing research) can help FL aggregation scale, but are ineffective from a resource utilization and cost standpoint. Next, we present the design and implementation of AdaFed, which uses serverless/cloud functions to adaptively scale aggregation in a resource efficient and fault tolerant manner. We describe how AdaFed enables FL aggregation to be dynamically deployed only when necessary, elastically scaled to handle participant joins/leaves and is fault tolerant with minimal effort required on the (aggregation) programmer side. We also demonstrate that our prototype based on Ray scales to thousands of participants, and is able to achieve a >90% reduction in resource requirements and cost, with minimal impact on aggregation latency.
△ Less
Submitted 6 November, 2022; v1 submitted 22 March, 2022;
originally announced March 2022.
-
Towards an Accountable and Reproducible Federated Learning: A FactSheets Approach
Authors:
Nathalie Baracaldo,
Ali Anwar,
Mark Purcell,
Ambrish Rawat,
Mathieu Sinn,
Bashar Altakrouri,
Dian Balta,
Mahdi Sellami,
Peter Kuhn,
Ulrich Schopp,
Matthias Buchinger
Abstract:
Federated Learning (FL) is a novel paradigm for the shared training of models based on decentralized and private data. With respect to ethical guidelines, FL is promising regarding privacy, but needs to excel vis-à-vis transparency and trustworthiness. In particular, FL has to address the accountability of the parties involved and their adherence to rules, law and principles. We introduce AF^2 Fra…
▽ More
Federated Learning (FL) is a novel paradigm for the shared training of models based on decentralized and private data. With respect to ethical guidelines, FL is promising regarding privacy, but needs to excel vis-à-vis transparency and trustworthiness. In particular, FL has to address the accountability of the parties involved and their adherence to rules, law and principles. We introduce AF^2 Framework, where we instrument FL with accountability by fusing verifiable claims with tamper-evident facts, into reproducible arguments. We build on AI FactSheets for instilling transparency and trustworthiness into the AI lifecycle and expand it to incorporate dynamic and nested facts, as well as complex model compositions in FL. Based on our approach, an auditor can validate, reproduce and certify a FL process. This can be directly applied in practice to address the challenges of AI engineering and ethics.
△ Less
Submitted 24 February, 2022;
originally announced February 2022.
-
Private Graph Data Release: A Survey
Authors:
Yang Li,
Michael Purcell,
Thierry Rakotoarivelo,
David Smith,
Thilina Ranbaduge,
Kee Siong Ng
Abstract:
The application of graph analytics to various domains has yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph data, especially in light of the many privacy breaches in real-world graph data that was supposed to preserve sensitive i…
▽ More
The application of graph analytics to various domains has yielded tremendous societal and economical benefits in recent years. However, the increasingly widespread adoption of graph analytics comes with a commensurate increase in the need to protect private information in graph data, especially in light of the many privacy breaches in real-world graph data that was supposed to preserve sensitive information. This paper provides a comprehensive survey of private graph data release algorithms that seek to achieve the fine balance between privacy and utility, with a specific focus on provably private mechanisms. Many of these mechanisms are natural extensions of the Differential Privacy framework to graph data, but we also investigate more general privacy formulations like Pufferfish Privacy that address some of the limitations of Differential Privacy. We also provide a wide-ranging survey of the applications of private graph data release mechanisms to social networks, finance, supply chain, and health care. This survey paper and the taxonomy it provides should benefit practitioners and researchers alike in the increasingly important area of private analytics and data release.
△ Less
Submitted 4 June, 2022; v1 submitted 9 July, 2021;
originally announced July 2021.
-
Knowledge- and Data-driven Services for Energy Systems using Graph Neural Networks
Authors:
Francesco Fusco,
Bradley Eck,
Robert Gormally,
Mark Purcell,
Seshu Tirupathi
Abstract:
The transition away from carbon-based energy sources poses several challenges for the operation of electricity distribution systems. Increasing shares of distributed energy resources (e.g. renewable energy generators, electric vehicles) and internet-connected sensing and control devices (e.g. smart heating and cooling) require new tools to support accurate, datadriven decision making. Modelling th…
▽ More
The transition away from carbon-based energy sources poses several challenges for the operation of electricity distribution systems. Increasing shares of distributed energy resources (e.g. renewable energy generators, electric vehicles) and internet-connected sensing and control devices (e.g. smart heating and cooling) require new tools to support accurate, datadriven decision making. Modelling the effect of such growing complexity in the electrical grid is possible in principle using state-of-the-art power-power flow models. In practice, the detailed information needed for these physical simulations may be unknown or prohibitively expensive to obtain. Hence, datadriven approaches to power systems modelling, including feedforward neural networks and auto-encoders, have been studied to leverage the increasing availability of sensor data, but have seen limited practical adoption due to lack of transparency and inefficiencies on large-scale problems. Our work addresses this gap by proposing a data- and knowledge-driven probabilistic graphical model for energy systems based on the framework of graph neural networks (GNNs). The model can explicitly factor in domain knowledge, in the form of grid topology or physics constraints, thus resulting in sparser architectures and much smaller parameters dimensionality when compared with traditional machine-learning models with similar accuracy. Results obtained from a real-world smart-grid demonstration project show how the GNN was used to inform grid congestion predictions and market bidding services for a distribution system operator participating in an energy flexibility market.
△ Less
Submitted 12 March, 2021;
originally announced March 2021.
-
IBM Federated Learning: an Enterprise Framework White Paper V0.1
Authors:
Heiko Ludwig,
Nathalie Baracaldo,
Gegi Thomas,
Yi Zhou,
Ali Anwar,
Shashank Rajamoni,
Yuya Ong,
Jayaram Radhakrishnan,
Ashish Verma,
Mathieu Sinn,
Mark Purcell,
Ambrish Rawat,
Tran Minh,
Naoise Holohan,
Supriyo Chakraborty,
Shalisha Whitherspoon,
Dean Steuer,
Laura Wynter,
Hifaz Hassan,
Sean Laguna,
Mikhail Yurochkin,
Mayank Agarwal,
Ebube Chuba,
Annie Abay
Abstract:
Federated Learning (FL) is an approach to conduct machine learning without centralizing training data in a single place, for reasons of privacy, confidentiality or data volume. However, solving federated machine learning problems raises issues above and beyond those of centralized machine learning. These issues include setting up communication infrastructure between parties, coordinating the learn…
▽ More
Federated Learning (FL) is an approach to conduct machine learning without centralizing training data in a single place, for reasons of privacy, confidentiality or data volume. However, solving federated machine learning problems raises issues above and beyond those of centralized machine learning. These issues include setting up communication infrastructure between parties, coordinating the learning process, integrating party results, understanding the characteristics of the training data sets of different participating parties, handling data heterogeneity, and operating with the absence of a verification data set.
IBM Federated Learning provides infrastructure and coordination for federated learning. Data scientists can design and run federated learning jobs based on existing, centralized machine learning models and can provide high-level instructions on how to run the federation. The framework applies to both Deep Neural Networks as well as ``traditional'' approaches for the most common machine learning libraries. {\proj} enables data scientists to expand their scope from centralized to federated machine learning, minimizing the learning curve at the outset while also providing the flexibility to deploy to different compute environments and design custom fusion algorithms.
△ Less
Submitted 22 July, 2020;
originally announced July 2020.
-
Scalable Deployment of AI Time-series Models for IoT
Authors:
Bradley Eck,
Francesco Fusco,
Robert Gormally,
Mark Purcell,
Seshu Tirupathi
Abstract:
IBM Research Castor, a cloud-native system for managing and deploying large numbers of AI time-series models in IoT applications, is described. Modelling code templates, in Python and R, following a typical machine-learning workflow are supported. A knowledge-based approach to managing model and time-series data allows the use of general semantic concepts for expressing feature engineering tasks.…
▽ More
IBM Research Castor, a cloud-native system for managing and deploying large numbers of AI time-series models in IoT applications, is described. Modelling code templates, in Python and R, following a typical machine-learning workflow are supported. A knowledge-based approach to managing model and time-series data allows the use of general semantic concepts for expressing feature engineering tasks. Model templates can be programmatically deployed against specific instances of semantic concepts, thus supporting model reuse and automated replication as the IoT application grows. Deployed models are automatically executed in parallel leveraging a serverless cloud computing framework. The complete history of trained model versions and rolling-horizon predictions is persisted, thus enabling full model lineage and traceability. Results from deployments in real-world smart-grid live forecasting applications are reported. Scalability of executing up to tens of thousands of AI modelling tasks is also evaluated.
△ Less
Submitted 24 March, 2020;
originally announced March 2020.