Search | arXiv e-print repository

Client-specific Property Inference against Secure Aggregation in Federated Learning

Authors: Raouf Kerkouche, Gergely Ács, Mario Fritz

Abstract: Federated learning has become a widely used paradigm for collaboratively training a common model among different participants with the help of a central server that coordinates the training. Although only the model parameters or other model updates are exchanged during the federated training instead of the participant's data, many attacks have shown that it is still possible to infer sensitive inf… ▽ More Federated learning has become a widely used paradigm for collaboratively training a common model among different participants with the help of a central server that coordinates the training. Although only the model parameters or other model updates are exchanged during the federated training instead of the participant's data, many attacks have shown that it is still possible to infer sensitive information such as membership, property, or outright reconstruction of participant data. Although differential privacy is considered an effective solution to protect against privacy attacks, it is also criticized for its negative effect on utility. Another possible defense is to use secure aggregation which allows the server to only access the aggregated update instead of each individual one, and it is often more appealing because it does not degrade model quality. However, combining only the aggregated updates, which are generated by a different composition of clients in every round, may still allow the inference of some client-specific information. In this paper, we show that simple linear models can effectively capture client-specific properties only from the aggregated model updates due to the linearity of aggregation. We formulate an optimization problem across different rounds in order to infer a tested property of every client from the output of the linear models, for example, whether they have a specific sample in their training data (membership inference) or whether they misbehave and attempt to degrade the performance of the common model by poisoning attacks. Our reconstruction technique is completely passive and undetectable. We demonstrate the efficacy of our approach on several scenarios which shows that secure aggregation provides very limited privacy guarantees in practice. The source code will be released upon publication. △ Less

Submitted 27 October, 2023; v1 submitted 7 March, 2023; originally announced March 2023.

Comments: Workshop on Privacy in the Electronic Society (WPES'23), held in conjunction with CCS'23

arXiv:2210.08871 [pdf, other]

Industry-Scale Orchestrated Federated Learning for Drug Discovery

Authors: Martijn Oldenhof, Gergely Ács, Balázs Pejó, Ansgar Schuffenhauer, Nicholas Holway, Noé Sturm, Arne Dieckmann, Oliver Fortmeier, Eric Boniface, Clément Mayer, Arnaud Gohier, Peter Schmidtke, Ritsuya Niwayama, Dieter Kopecky, Lewis Mervin, Prakash Chandra Rathi, Lukas Friedrich, András Formanek, Peter Antal, Jordon Rahaman, Adam Zalewski, Wouter Heyndrickx, Ezron Oluoch, Manuel Stößel, Michal Vančo , et al. (22 additional authors not shown)

Abstract: To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n°831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated mo… ▽ More To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n°831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper. △ Less

Submitted 12 December, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

Comments: 9 pages, 4 figures, to appear in AAAI-23 ([IAAI-23 track] Deployed Highly Innovative Applications of AI)

arXiv:2205.06506 [pdf, other]

Collaborative Drug Discovery: Inference-level Data Protection Perspective

Authors: Balazs Pejo, Mina Remeli, Adam Arany, Mathieu Galtier, Gergely Acs

Abstract: Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborat… ▽ More Pharmaceutical industry can better leverage its data assets to virtualize drug discovery through a collaborative machine learning platform. On the other hand, there are non-negligible risks stemming from the unintended leakage of participants' training data, hence, it is essential for such a platform to be secure and privacy-preserving. This paper describes a privacy risk assessment for collaborative modeling in the preclinical phase of drug discovery to accelerate the selection of promising drug candidates. After a short taxonomy of state-of-the-art inference attacks we adopt and customize several to the underlying scenario. Finally we describe and experiments with a handful of relevant privacy protection techniques to mitigate such attacks. △ Less

Submitted 9 June, 2022; v1 submitted 13 May, 2022; originally announced May 2022.

arXiv:2103.00342 [pdf, other]

Constrained Differentially Private Federated Learning for Low-bandwidth Devices

Authors: Raouf Kerkouche, Gergely Ács, Claude Castelluccia, Pierre Genevès

Abstract: Federated learning becomes a prominent approach when different entities want to learn collaboratively a common model without sharing their training data. However, Federated learning has two main drawbacks. First, it is quite bandwidth inefficient as it involves a lot of message exchanges between the aggregating server and the participating entities. This bandwidth and corresponding processing cost… ▽ More Federated learning becomes a prominent approach when different entities want to learn collaboratively a common model without sharing their training data. However, Federated learning has two main drawbacks. First, it is quite bandwidth inefficient as it involves a lot of message exchanges between the aggregating server and the participating entities. This bandwidth and corresponding processing costs could be prohibitive if the participating entities are, for example, mobile devices. Furthermore, although federated learning improves privacy by not sharing data, recent attacks have shown that it still leaks information about the training data. This paper presents a novel privacy-preserving federated learning scheme. The proposed scheme provides theoretical privacy guarantees, as it is based on Differential Privacy. Furthermore, it optimizes the model accuracy by constraining the model learning phase on few selected weights. Finally, as shown experimentally, it reduces the upstream and downstream bandwidth by up to 99.9% compared to standard federated learning, making it practical for mobile systems. △ Less

Submitted 27 February, 2021; originally announced March 2021.

Comments: arXiv admin note: text overlap with arXiv:2011.05578

arXiv:2011.05578 [pdf, ps, other]

Compression Boosts Differentially Private Federated Learning

Authors: Raouf Kerkouche, Gergely Ács, Claude Castelluccia, Pierre Genevès

Abstract: Federated Learning allows distributed entities to train a common model collaboratively without sharing their own data. Although it prevents data collection and aggregation by exchanging only parameter updates, it remains vulnerable to various inference and reconstruction attacks where a malicious entity can learn private information about the participants' training data from the captured gradients… ▽ More Federated Learning allows distributed entities to train a common model collaboratively without sharing their own data. Although it prevents data collection and aggregation by exchanging only parameter updates, it remains vulnerable to various inference and reconstruction attacks where a malicious entity can learn private information about the participants' training data from the captured gradients. Differential Privacy is used to obtain theoretically sound privacy guarantees against such inference attacks by noising the exchanged update vectors. However, the added noise is proportional to the model size which can be very large with modern neural networks. This can result in poor model quality. In this paper, compressive sensing is used to reduce the model size and hence increase model quality without sacrificing privacy. We show experimentally, using 2 datasets, that our privacy-preserving proposal can reduce the communication costs by up to 95% with only a negligible performance penalty compared to traditional non-private federated learning schemes. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: arXiv admin note: text overlap with arXiv:2010.07808

arXiv:2010.07808 [pdf, other]

Federated Learning in Adversarial Settings

Authors: Raouf Kerkouche, Gergely Ács, Claude Castelluccia

Abstract: Federated Learning enables entities to collaboratively learn a shared prediction model while kee** their training data locally. It prevents data collection and aggregation and, therefore, mitigates the associated privacy risks. However, it still remains vulnerable to various security attacks where malicious participants aim at degrading the generated model, inserting backdoors, or inferring othe… ▽ More Federated Learning enables entities to collaboratively learn a shared prediction model while kee** their training data locally. It prevents data collection and aggregation and, therefore, mitigates the associated privacy risks. However, it still remains vulnerable to various security attacks where malicious participants aim at degrading the generated model, inserting backdoors, or inferring other participants' training data. This paper presents a new federated learning scheme that provides different trade-offs between robustness, privacy, bandwidth efficiency, and model accuracy. Our scheme uses biased quantization of model updates and hence is bandwidth efficient. It is also robust against state-of-the-art backdoor as well as model degradation attacks even when a large proportion of the participant nodes are malicious. We propose a practical differentially private extension of this scheme which protects the whole dataset of participating entities. We show that this extension performs as efficiently as the non-private but robust scheme, even with stringent privacy requirements but are less robust against model degradation and backdoor attacks. This suggests a possible fundamental trade-off between Differential Privacy and robustness. △ Less

Submitted 15 October, 2020; originally announced October 2020.

arXiv:2008.01665 [pdf, other]

In Search of Lost Utility: Private Location Data

Authors: Szilvia Lestyán, Gergely Ács, Gergely Biczók

Abstract: The unavailability of training data is a permanent source of much frustration in research, especially when it is due to privacy concerns. This is particularly true for location data since previous techniques all suffer from the inherent sparseness and high dimensionality of location trajectories which render most techniques impractical, resulting in unrealistic traces and unscalable methods. Moreo… ▽ More The unavailability of training data is a permanent source of much frustration in research, especially when it is due to privacy concerns. This is particularly true for location data since previous techniques all suffer from the inherent sparseness and high dimensionality of location trajectories which render most techniques impractical, resulting in unrealistic traces and unscalable methods. Moreover, time information of location visits is usually dropped, or its resolution is drastically reduced. In this paper we present a novel technique for privately releasing a composite generative model and whole high-dimensional location datasets with detailed time information. To generate high-fidelity synthetic data, we leverage several peculiarities of vehicular mobility such as its language-like characteristics ("you should know a location by the company it keeps") or how humans plan their trips from one point to the other. We model the generator distribution of the dataset by first constructing a variational autoencoder to generate the source and destination locations, and the corresponding timing of trajectories. Next, we compute transition probabilities between locations with a feed forward network, and build a transition graph from the output of this model, which approximates the distribution of all paths between the source and destination (at a given time). Finally, a path is sampled from this distribution with a Markov Chain Monte Carlo method. The generated synthetic dataset is highly realistic, scalable, provides good utility and, nonetheless, provably private. We evaluate our model against two state-of-the-art methods and three real-life datasets demonstrating the benefits of our approach. △ Less

Submitted 14 March, 2022; v1 submitted 4 August, 2020; originally announced August 2020.

Comments: Accepted at PETS '22

arXiv:1911.09508 [pdf, other]

Automatic Driver Identification from In-Vehicle Network Logs

Authors: Mina Remeli, Szilvia Lestyan, Gergely Acs, Gergely Biczok

Abstract: Data generated by cars is growing at an unprecedented scale. As cars gradually become part of the Internet of Things (IoT) ecosystem, several stakeholders discover the value of in-vehicle network logs containing the measurements of the multitude of sensors deployed within the car. This wealth of data is also expected to be exploitable by third parties for the purpose of profiling drivers in order… ▽ More Data generated by cars is growing at an unprecedented scale. As cars gradually become part of the Internet of Things (IoT) ecosystem, several stakeholders discover the value of in-vehicle network logs containing the measurements of the multitude of sensors deployed within the car. This wealth of data is also expected to be exploitable by third parties for the purpose of profiling drivers in order to provide personalized, valueadded services. Although several prior works have successfully demonstrated the feasibility of driver re-identification using the in-vehicle network data captured on the vehicle's CAN (Controller Area Network) bus, they inferred the identity of the driver only from known sensor signals (such as the vehicle's speed, brake pedal position, steering wheel angle, etc.) extracted from the CAN messages. However, car manufacturers intentionally do not reveal exact signal location and semantics within CAN logs. We show that the inference of driver identity is possible even with off-the-shelf machine learning techniques without reverse-engineering the CAN protocol. We demonstrate our approach on a dataset of 33 drivers and show that a driver can be re-identified and distinguished from other drivers with an accuracy of 75-85%. △ Less

Submitted 25 October, 2019; originally announced November 2019.

arXiv:1902.08956 [pdf, other]

Extracting vehicle sensor signals from CAN logs for driver re-identification

Authors: Szilvia Lestyan, Gergely Acs, Gergely Biczok, Zsolt Szalay

Abstract: Data is the new oil for the car industry. Cars generate data about how they are used and who's behind the wheel which gives rise to a novel way of profiling individuals. Several prior works have successfully demonstrated the feasibility of driver re-identification using the in-vehicle network data captured on the vehicle's CAN (Controller Area Network) bus. However, all of them used signals (e.g.,… ▽ More Data is the new oil for the car industry. Cars generate data about how they are used and who's behind the wheel which gives rise to a novel way of profiling individuals. Several prior works have successfully demonstrated the feasibility of driver re-identification using the in-vehicle network data captured on the vehicle's CAN (Controller Area Network) bus. However, all of them used signals (e.g., velocity, brake pedal or accelerator position) that have already been extracted from the CAN log which is itself not a straightforward process. Indeed, car manufacturers intentionally do not reveal the exact signal location within CAN logs. Nevertheless, we show that signals can be efficiently extracted from CAN logs using machine learning techniques. We exploit that signals have several distinguishing statistical features which can be learnt and effectively used to identify them across different vehicles, that is, to quasi "reverse-engineer" the CAN protocol. We also demonstrate that the extracted signals can be successfully used to re-identify individuals in a dataset of 33 drivers. Therefore, not revealing signal locations in CAN logs per se does not prevent them to be regarded as personal data of drivers. △ Less

Submitted 25 October, 2019; v1 submitted 24 February, 2019; originally announced February 2019.

Comments: 10 pages

arXiv:1709.04514 [pdf, other]

Differentially Private Mixture of Generative Neural Networks

Authors: Gergely Acs, Luca Melis, Claude Castelluccia, Emiliano De Cristofaro

Abstract: Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional… ▽ More Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models. We model the generator distribution of the training data with a mixture of $k$ generative neural networks. These are trained together and collectively learn the generator distribution of a dataset. Data is divided into $k$ clusters, using a novel differentially private kernel $k$-means, then each cluster is given to separate generative neural networks, such as Restricted Boltzmann Machines or Variational Autoencoders, which are trained only on their own cluster using differentially private gradient descent. We evaluate our approach using the MNIST dataset, as well as call detail records and transit datasets, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries. △ Less

Submitted 13 July, 2018; v1 submitted 13 September, 2017; originally announced September 2017.

Comments: A shorter version of this paper appeared at the 17th IEEE International Conference on Data Mining (ICDM 2017). This is the full version, published in IEEE Transactions on Knowledge and Data Engineering (TKDE)

arXiv:1605.08664 [pdf, other]

doi 10.1515/popets-2016-0051

Near-Optimal Fingerprinting with Constraints

Authors: Gabor Gyorgy Gulyas, Gergely Acs, Claude Castelluccia

Abstract: Several recent studies have demonstrated that people show large behavioural uniqueness. This has serious privacy implications as most individuals become increasingly re-identifiable in large datasets or can be tracked while they are browsing the web using only a couple of their attributes, called as their fingerprints. Often, the success of these attacks depend on explicit constraints on the numbe… ▽ More Several recent studies have demonstrated that people show large behavioural uniqueness. This has serious privacy implications as most individuals become increasingly re-identifiable in large datasets or can be tracked while they are browsing the web using only a couple of their attributes, called as their fingerprints. Often, the success of these attacks depend on explicit constraints on the number of attributes learnable about individuals, i.e., the size of their fingerprints. These constraints can be budget as well as technical constraints imposed by the data holder. For instance, Apple restricts the number of applications that can be called by another application on iOS in order to mitigate the potential privacy threats of leaking the list of installed applications on a device. In this work, we address the problem of identifying the attributes (e.g., smartphone applications) that can serve as a fingerprint of users given constraints on the size of the fingerprint. We give the best fingerprinting algorithms in general, and evaluate their effectiveness on several real-world datasets. Our results show that current privacy guards limiting the number of attributes that can be queried about individuals is insufficient to mitigate their potential privacy risks in many practical cases. △ Less

Submitted 3 June, 2016; v1 submitted 27 May, 2016; originally announced May 2016.

arXiv:1507.07851 [pdf, other]

On the Unicity of Smartphone Applications

Authors: Jagdish Prasad Achara, Gergely Acs, Claude Castelluccia

Abstract: Prior works have shown that the list of apps installed by a user reveal a lot about user interests and behavior. These works rely on the semantics of the installed apps and show that various user traits could be learnt automatically using off-the-shelf machine-learning techniques. In this work, we focus on the re-identifiability issue and thoroughly study the unicity of smartphone apps on a datase… ▽ More Prior works have shown that the list of apps installed by a user reveal a lot about user interests and behavior. These works rely on the semantics of the installed apps and show that various user traits could be learnt automatically using off-the-shelf machine-learning techniques. In this work, we focus on the re-identifiability issue and thoroughly study the unicity of smartphone apps on a dataset containing 54,893 Android users collected over a period of 7 months. Our study finds that any 4 apps installed by a user are enough (more than 95% times) for the re-identification of the user in our dataset. As the complete list of installed apps is unique for 99% of the users in our dataset, it can be easily used to track/profile the users by a service such as Twitter that has access to the whole list of installed apps of users. As our analyzed dataset is small as compared to the total population of Android users, we also study how unicity would vary with larger datasets. This work emphasizes the need of better privacy guards against collection, use and release of the list of installed apps. △ Less

Submitted 29 October, 2015; v1 submitted 28 July, 2015; originally announced July 2015.

Comments: 10 pages, 9 Figures, Appeared at ACM CCS Workshop on Privacy in Electronic Society (WPES) 2015

arXiv:1404.4533 [pdf, other]

Retargeting Without Tracking

Authors: Minh-Dung Tran, Gergely Acs, Claude Castelluccia

Abstract: Retargeting ads are increasingly prevalent on the Internet as their effectiveness has been shown to outperform conventional targeted ads. Retargeting ads are not only based on users' interests, but also on their intents, i.e. commercial products users have shown interest in. Existing retargeting systems heavily rely on tracking, as retargeting companies need to know not only the websites a user ha… ▽ More Retargeting ads are increasingly prevalent on the Internet as their effectiveness has been shown to outperform conventional targeted ads. Retargeting ads are not only based on users' interests, but also on their intents, i.e. commercial products users have shown interest in. Existing retargeting systems heavily rely on tracking, as retargeting companies need to know not only the websites a user has visited but also the exact products on these sites. They are therefore very intrusive, and privacy threatening. Furthermore, these schemes are still sub-optimal since tracking is partial, and they often deliver ads that are obsolete (because, for example, the targeted user has already bought the advertised product). This paper presents the first privacy-preserving retargeting ads system. In the proposed scheme, the retargeting algorithm is distributed between the user and the advertiser such that no systematic tracking is necessary, more control and transparency is provided to users, but still a lot of targeting flexibility is provided to advertisers. We show that our scheme, that relies on homomorphic encryption, can be efficiently implemented and trivially solves many problems of existing schemes, such as frequency cap** and ads freshness. △ Less

Submitted 17 April, 2014; originally announced April 2014.

arXiv:1201.2531 [pdf, ps, other]

DREAM: DiffeRentially privatE smArt Metering

Authors: Gergely Acs, Claude Castelluccia

Abstract: This paper presents a new privacy-preserving smart metering system. Our scheme is private under the differential privacy model and therefore provides strong and provable guarantees. With our scheme, an (electricity) supplier can periodically collect data from smart meters and derive aggregated statistics while learning only limited information about the activities of individual households. For exa… ▽ More This paper presents a new privacy-preserving smart metering system. Our scheme is private under the differential privacy model and therefore provides strong and provable guarantees. With our scheme, an (electricity) supplier can periodically collect data from smart meters and derive aggregated statistics while learning only limited information about the activities of individual households. For example, a supplier cannot tell from a user's trace when he watched TV or turned on heating. Our scheme is simple, efficient and practical. Processing cost is very limited: smart meters only have to add noise to their data and encrypt the results with an efficient stream cipher. △ Less

Submitted 12 January, 2012; originally announced January 2012.

Comments: Shorter version appeared on Information Hiding Conference 2011

Showing 1–14 of 14 results for author: Acs, G