Search | arXiv e-print repository

Noisy Neighbors: Efficient membership inference attacks against LLMs

Authors: Filippo Galli, Luca Melis, Tommaso Cucinotta

Abstract: The potential of transformer-based LLMs risks being hindered by privacy concerns due to their reliance on extensive datasets, possibly including sensitive information. Regulatory measures like GDPR and CCPA call for using robust auditing tools to address potential privacy issues, with Membership Inference Attacks (MIA) being the primary method for assessing LLMs' privacy risks. Differently from tr… ▽ More The potential of transformer-based LLMs risks being hindered by privacy concerns due to their reliance on extensive datasets, possibly including sensitive information. Regulatory measures like GDPR and CCPA call for using robust auditing tools to address potential privacy issues, with Membership Inference Attacks (MIA) being the primary method for assessing LLMs' privacy risks. Differently from traditional MIA approaches, often requiring computationally intensive training of additional models, this paper introduces an efficient methodology that generates \textit{noisy neighbors} for a target sample by adding stochastic noise in the embedding space, requiring operating the target model in inference mode only. Our findings demonstrate that this approach closely matches the effectiveness of employing shadow models, showing its usability in practical privacy auditing scenarios. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2309.13793 [pdf, other]

ReMasker: Imputing Tabular Data with Masked Autoencoding

Authors: Tianyu Du, Luca Melis, Ting Wang

Abstract: We present ReMasker, a new method of imputing missing values in tabular data by extending the masked autoencoding framework. Compared with prior work, ReMasker is both simple -- besides the missing values (i.e., naturally masked), we randomly ``re-mask'' another set of values, optimize the autoencoder by reconstructing this re-masked set, and apply the trained model to predict the missing values;… ▽ More We present ReMasker, a new method of imputing missing values in tabular data by extending the masked autoencoding framework. Compared with prior work, ReMasker is both simple -- besides the missing values (i.e., naturally masked), we randomly ``re-mask'' another set of values, optimize the autoencoder by reconstructing this re-masked set, and apply the trained model to predict the missing values; and effective -- with extensive evaluation on benchmark datasets, we show that ReMasker performs on par with or outperforms state-of-the-art methods in terms of both imputation fidelity and utility under various missingness settings, while its performance advantage often increases with the ratio of missing data. We further explore theoretical justification for its effectiveness, showing that ReMasker tends to learn missingness-invariant representations of tabular data. Our findings indicate that masked modeling represents a promising direction for further research on tabular data imputation. The code is publicly available. △ Less

Submitted 24 September, 2023; originally announced September 2023.

arXiv:2306.13434 [pdf, other]

doi 10.1051/0004-6361/202346459

Self-consistent equilibrium models of prominence thin threads heated by Alfvén waves propagating from the photosphere

Authors: Llorenç Melis, Roberto Soler, Jaume Terradas

Abstract: The fine structure of solar prominences is made by thin threads that outline the magnetic field lines. Observations show that transverse waves of Alfvénic nature are ubiquitous in prominence threads. These waves are driven at the photosphere and propagate to prominences suspended in the corona. Heating due to Alfvén wave dissipation could be a relevant mechanism in the cool and partially ionized p… ▽ More The fine structure of solar prominences is made by thin threads that outline the magnetic field lines. Observations show that transverse waves of Alfvénic nature are ubiquitous in prominence threads. These waves are driven at the photosphere and propagate to prominences suspended in the corona. Heating due to Alfvén wave dissipation could be a relevant mechanism in the cool and partially ionized prominence plasma. We explore the construction of 1D equilibrium models of prominence thin threads that satisfy energy balance between radiative losses, thermal conduction, and Alfvén wave heating. We assume the presence of a broadband driver at the photosphere that launches Alfvén waves towards the prominence. An iterative method is implemented, in which the energy balance equation and the Alfvén wave equation are consecutively solved. From the energy balance equation and considering no wave heating initially, we compute the equilibrium profiles along the thread of the temperature, density, ionisation fraction. We use the Alfvén wave equation to compute the wave heating rate, which is then put back in the energy balance equation to obtain new equilibrium profiles. The process is repeated until convergence to a self-consistent thread model heated by Alfvén waves is achieved. We have obtained equilibrium models composed of a cold and dense thread, a extremely thin PCTR, and an extended coronal region. The length of the cold thread decreases with the temperature at the prominence core and increases with the Alfvén wave energy flux. Equilibrium models are not possible for sufficiently large wave energy fluxes when the wave heating rate inside the cold thread becomes larger than radiative losses. The maximum value of the wave energy flux that allows an equilibrium depends on the prominence core temperature. This constrains the existence of equilibria in realistic conditions. △ Less

Submitted 23 June, 2023; originally announced June 2023.

Comments: 12 pages, 11 figures

Journal ref: A&A 676, A25 (2023)

arXiv:2306.05275 [pdf, ps, other]

Federated Linear Contextual Bandits with User-level Differential Privacy

Authors: Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hajzinia, **g Yang

Abstract: This paper studies federated linear contextual bandits under the notion of user-level differential privacy (DP). We first introduce a unified federated bandits framework that can accommodate various definitions of DP in the sequential decision-making setting. We then formally introduce user-level central DP (CDP) and local DP (LDP) in the federated bandits framework, and investigate the fundamenta… ▽ More This paper studies federated linear contextual bandits under the notion of user-level differential privacy (DP). We first introduce a unified federated bandits framework that can accommodate various definitions of DP in the sequential decision-making setting. We then formally introduce user-level central DP (CDP) and local DP (LDP) in the federated bandits framework, and investigate the fundamental trade-offs between the learning regrets and the corresponding DP guarantees in a federated linear contextual bandits model. For CDP, we propose a federated algorithm termed as $\texttt{ROBIN}$ and show that it is near-optimal in terms of the number of clients $M$ and the privacy budget $\varepsilon$ by deriving nearly-matching upper and lower regret bounds when user-level DP is satisfied. For LDP, we obtain several lower bounds, indicating that learning under user-level $(\varepsilon,δ)$-LDP must suffer a regret blow-up factor at least $\min\{1/\varepsilon,M\}$ or $\min\{1/\sqrt{\varepsilon},\sqrt{M}\}$ under different conditions. △ Less

Submitted 9 June, 2023; v1 submitted 8 June, 2023; originally announced June 2023.

Comments: Accepted by ICML 2023

arXiv:2305.12997 [pdf, other]

Evaluating Privacy Leakage in Split Learning

Authors: Xinchi Qiu, Ilias Leontiadis, Luca Melis, Alex Sablayrolles, Pierre Stock

Abstract: Privacy-Preserving machine learning (PPML) can help us train and deploy models that utilize private information. In particular, on-device machine learning allows us to avoid sharing raw data with a third-party server during inference. On-device models are typically less accurate when compared to their server counterparts due to the fact that (1) they typically only rely on a small set of on-device… ▽ More Privacy-Preserving machine learning (PPML) can help us train and deploy models that utilize private information. In particular, on-device machine learning allows us to avoid sharing raw data with a third-party server during inference. On-device models are typically less accurate when compared to their server counterparts due to the fact that (1) they typically only rely on a small set of on-device features and (2) they need to be small enough to run efficiently on end-user devices. Split Learning (SL) is a promising approach that can overcome these limitations. In SL, a large machine learning model is divided into two parts, with the bigger part residing on the server side and a smaller part executing on-device, aiming to incorporate the private features. However, end-to-end training of such models requires exchanging gradients at the cut layer, which might encode private features or labels. In this paper, we provide insights into potential privacy risks associated with SL. Furthermore, we also investigate the effectiveness of various mitigation strategies. Our results indicate that the gradients significantly improve the attackers' effectiveness in all tested datasets reaching almost perfect reconstruction accuracy for some features. However, a small amount of differential privacy (DP) can effectively mitigate this risk without causing significant training degradation. △ Less

Submitted 19 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: 10 pages

arXiv:2304.12667 [pdf, other]

Disagreement amongst counterfactual explanations: How transparency can be deceptive

Authors: Dieter Brughmans, Lissa Melis, David Martens

Abstract: Counterfactual explanations are increasingly used as an Explainable Artificial Intelligence (XAI) technique to provide stakeholders of complex machine learning algorithms with explanations for data-driven decisions. The popularity of counterfactual explanations resulted in a boom in the algorithms generating them. However, not every algorithm creates uniform explanations for the same instance. Eve… ▽ More Counterfactual explanations are increasingly used as an Explainable Artificial Intelligence (XAI) technique to provide stakeholders of complex machine learning algorithms with explanations for data-driven decisions. The popularity of counterfactual explanations resulted in a boom in the algorithms generating them. However, not every algorithm creates uniform explanations for the same instance. Even though in some contexts multiple possible explanations are beneficial, there are circumstances where diversity amongst counterfactual explanations results in a potential disagreement problem among stakeholders. Ethical issues arise when for example, malicious agents use this diversity to fairwash an unfair machine learning model by hiding sensitive features. As legislators worldwide tend to start including the right to explanations for data-driven, high-stakes decisions in their policies, these ethical issues should be understood and addressed. Our literature review on the disagreement problem in XAI reveals that this problem has never been empirically assessed for counterfactual explanations. Therefore, in this work, we conduct a large-scale empirical analysis, on 40 datasets, using 12 explanation-generating methods, for two black-box models, yielding over 192.0000 explanations. Our study finds alarmingly high disagreement levels between the methods tested. A malicious user is able to both exclude and include desired features when multiple counterfactual explanations are available. This disagreement seems to be driven mainly by the dataset characteristics and the type of counterfactual algorithm. XAI centers on the transparency of algorithmic decision-making, but our analysis advocates for transparency about this self-proclaimed transparency △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2206.03852 [pdf, other]

FEL: High Capacity Learning for Recommendation and Ranking via Federated Ensemble Learning

Authors: Meisam Hejazinia, Dzmitry Huba, Ilias Leontiadis, Kiwan Maeng, Mani Malek, Luca Melis, Ilya Mironov, Milad Nasr, Kaikai Wang, Carole-Jean Wu

Abstract: Federated learning (FL) has emerged as an effective approach to address consumer privacy needs. FL has been successfully applied to certain machine learning tasks, such as training smart keyboard models and keyword spotting. Despite FL's initial success, many important deep learning use cases, such as ranking and recommendation tasks, have been limited from on-device learning. One of the key chall… ▽ More Federated learning (FL) has emerged as an effective approach to address consumer privacy needs. FL has been successfully applied to certain machine learning tasks, such as training smart keyboard models and keyword spotting. Despite FL's initial success, many important deep learning use cases, such as ranking and recommendation tasks, have been limited from on-device learning. One of the key challenges faced by practical FL adoption for DL-based ranking and recommendation is the prohibitive resource requirements that cannot be satisfied by modern mobile systems. We propose Federated Ensemble Learning (FEL) as a solution to tackle the large memory requirement of deep learning ranking and recommendation tasks. FEL enables large-scale ranking and recommendation model training on-device by simultaneously training multiple model versions on disjoint clusters of client devices. FEL integrates the trained sub-models via an over-arch layer into an ensemble model that is hosted on the server. Our experiments demonstrate that FEL leads to 0.43-2.31% model quality improvement over traditional on-device federated learning - a significant improvement for ranking and recommendation system use cases. △ Less

Submitted 7 June, 2022; originally announced June 2022.

arXiv:2206.02633 [pdf, other]

Towards Fair Federated Recommendation Learning: Characterizing the Inter-Dependence of System and Data Heterogeneity

Authors: Kiwan Maeng, Haiyu Lu, Luca Melis, John Nguyen, Mike Rabbat, Carole-Jean Wu

Abstract: Federated learning (FL) is an effective mechanism for data privacy in recommender systems by running machine learning model training on-device. While prior FL optimizations tackled the data and system heterogeneity challenges faced by FL, they assume the two are independent of each other. This fundamental assumption is not reflective of real-world, large-scale recommender systems -- data and syste… ▽ More Federated learning (FL) is an effective mechanism for data privacy in recommender systems by running machine learning model training on-device. While prior FL optimizations tackled the data and system heterogeneity challenges faced by FL, they assume the two are independent of each other. This fundamental assumption is not reflective of real-world, large-scale recommender systems -- data and system heterogeneity are tightly intertwined. This paper takes a data-driven approach to show the inter-dependence of data and system heterogeneity in real-world data and quantifies its impact on the overall model quality and fairness. We design a framework, RF^2, to model the inter-dependence and evaluate its impact on state-of-the-art model optimization techniques for federated recommendation tasks. We demonstrate that the impact on fairness can be severe under realistic heterogeneity scenarios, by up to 15.8--41x compared to a simple setup assumed in most (if not all) prior work. It means when realistic system-induced data heterogeneity is not properly modeled, the fairness impact of an optimization can be downplayed by up to 41x. The result shows that modeling realistic system-induced data heterogeneity is essential to achieving fair federated recommendation learning. We plan to open-source RF^2 to enable future design and evaluation of FL innovations. △ Less

Submitted 30 May, 2022; originally announced June 2022.

arXiv:2103.16599 [pdf, ps, other]

doi 10.1051/0004-6361/202140523

Alfven wave heating in partially ionized thin threads of solar prominences

Authors: Llorenc Melis, Roberto Soler, Jose Luis Ballester

Abstract: There is observational evidence of the presence of small-amplitude transverse magnetohydrodynamic (MHD) waves with a wide range of frequencies in the threads of solar prominences. It is believed that the waves are driven at the photosphere and propagate along the magnetic field lines up to prominences suspended in the corona. The dissipation of MHD wave energy in the partially ionized prominence p… ▽ More There is observational evidence of the presence of small-amplitude transverse magnetohydrodynamic (MHD) waves with a wide range of frequencies in the threads of solar prominences. It is believed that the waves are driven at the photosphere and propagate along the magnetic field lines up to prominences suspended in the corona. The dissipation of MHD wave energy in the partially ionized prominence plasma is a heating mechanism whose relevance needs to be explored. Here we consider a simple 1D model for a non-uniform thin thread and investigate the heating associated with dissipation of Alfven waves. The model assumes an ad hoc density profile and a uniform pressure, while the temperature and ionization degree are self-consistently computed considering either LTE or non-LTE approximations for the hydrogen ionization. A broadband driver for Alfven waves is placed at one end of the magnetic field line, representing photospheric excitation. The Alfvenic perturbations along the thread are obtained by solving the linearized MHD equations for a partially ionized plasma in the single-fluid approximation.We find that wave heating in the partially ionized part of the thread is significant enough to compensate for energy losses due to radiative cooling. A greater amount of heating is found in the LTE case because the ionization degree for core prominence temperatures is lower than that in the non-LTE approximation. This results in a greater level of dissipation due to ambipolar diffusion in the LTE case. Conversely, in the hot coronal part of the model, the plasma is fully ionized and wave heating is negligible. The results of this simple model suggest that MHD wave heating can be relevant for the energy balance in prominences. Further studies based on more elaborate models are required. △ Less

Submitted 30 March, 2021; originally announced March 2021.

Journal ref: A&A 650, A45 (2021)

arXiv:2103.06641 [pdf, other]

Differentially Private Query Release Through Adaptive Projection

Authors: Sergul Aydore, William Brown, Michael Kearns, Krishnaram Kenthapadi, Luca Melis, Aaron Roth, Ankit Siva

Abstract: We propose, implement, and evaluate a new algorithm for releasing answers to very large numbers of statistical queries like $k$-way marginals, subject to differential privacy. Our algorithm makes adaptive use of a continuous relaxation of the Projection Mechanism, which answers queries on the private dataset using simple perturbation, and then attempts to find the synthetic dataset that most close… ▽ More We propose, implement, and evaluate a new algorithm for releasing answers to very large numbers of statistical queries like $k$-way marginals, subject to differential privacy. Our algorithm makes adaptive use of a continuous relaxation of the Projection Mechanism, which answers queries on the private dataset using simple perturbation, and then attempts to find the synthetic dataset that most closely matches the noisy answers. We use a continuous relaxation of the synthetic dataset domain which makes the projection loss differentiable, and allows us to use efficient ML optimization techniques and tooling. Rather than answering all queries up front, we make judicious use of our privacy budget by iteratively and adaptively finding queries for which our (relaxed) synthetic data has high error, and then repeating the projection. We perform extensive experimental evaluations across a range of parameters and datasets, and find that our method outperforms existing algorithms in many cases, especially when the privacy budget is small or the query class is large. △ Less

Submitted 23 June, 2021; v1 submitted 11 March, 2021; originally announced March 2021.

arXiv:2102.12002 [pdf, other]

Adversarial Robustness with Non-uniform Perturbations

Authors: Ecenaz Erdemir, Jeffrey Bickford, Luca Melis, Sergul Aydore

Abstract: Robustness of machine learning models is critical for security related applications, where real-world adversaries are uniquely focused on evading neural network based detectors. Prior work mainly focus on crafting adversarial examples (AEs) with small uniform norm-bounded perturbations across features to maintain the requirement of imperceptibility. However, uniform perturbations do not result in… ▽ More Robustness of machine learning models is critical for security related applications, where real-world adversaries are uniquely focused on evading neural network based detectors. Prior work mainly focus on crafting adversarial examples (AEs) with small uniform norm-bounded perturbations across features to maintain the requirement of imperceptibility. However, uniform perturbations do not result in realistic AEs in domains such as malware, finance, and social networks. For these types of applications, features typically have some semantically meaningful dependencies. The key idea of our proposed approach is to enable non-uniform perturbations that can adequately represent these feature dependencies during adversarial training. We propose using characteristics of the empirical data distribution, both on correlations between the features and the importance of the features themselves. Using experimental datasets for malware classification, credit risk prediction, and spam detection, we show that our approach is more robust to real-world attacks. Finally, we present robustness certification utilizing non-uniform perturbation bounds, and show that non-uniform bounds achieve better certification. △ Less

Submitted 29 October, 2021; v1 submitted 23 February, 2021; originally announced February 2021.

Comments: Accepted to NeurIPS 2021

arXiv:1810.02649 [pdf, other]

On Collaborative Predictive Blacklisting

Authors: Luca Melis, Apostolos Pyrgelis, Emiliano De Cristofaro

Abstract: Collaborative predictive blacklisting (CPB) allows to forecast future attack sources based on logs and alerts contributed by multiple organizations. Unfortunately, however, research on CPB has only focused on increasing the number of predicted attacks but has not considered the impact on false positives and false negatives. Moreover, sharing alerts is often hindered by confidentiality, trust, and… ▽ More Collaborative predictive blacklisting (CPB) allows to forecast future attack sources based on logs and alerts contributed by multiple organizations. Unfortunately, however, research on CPB has only focused on increasing the number of predicted attacks but has not considered the impact on false positives and false negatives. Moreover, sharing alerts is often hindered by confidentiality, trust, and liability issues, which motivates the need for privacy-preserving approaches to the problem. In this paper, we present a measurement study of state-of-the-art CPB techniques, aiming to shed light on the actual impact of collaboration. To this end, we reproduce and measure two systems: a non privacy-friendly one that uses a trusted coordinating party with access to all alerts (Soldo et al., 2010) and a peer-to-peer one using privacy-preserving data sharing (Freudiger et al., 2015). We show that, while collaboration boosts the number of predicted attacks, it also yields high false positives, ultimately leading to poor accuracy. This motivates us to present a hybrid approach, using a semi-trusted central entity, aiming to increase utility from collaboration while, at the same time, limiting information disclosure and false positives. This leads to a better trade-off of true and false positive rates, while at the same time addressing privacy concerns. △ Less

Submitted 5 October, 2018; originally announced October 2018.

Comments: A preliminary version of this paper appears in ACM SIGCOMM's Computer Communication Review (Volume 48 Issue 5, October 2018). This is the full version

arXiv:1805.04049 [pdf, other]

Exploiting Unintended Feature Leakage in Collaborative Learning

Authors: Luca Melis, Congzheng Song, Emiliano De Cristofaro, Vitaly Shmatikov

Abstract: Collaborative machine learning and related techniques such as federated learning allow multiple participants, each with his own training dataset, to build a joint model by training locally and periodically exchanging model updates. We demonstrate that these updates leak unintended information about participants' training data and develop passive and active inference attacks to exploit this leakage… ▽ More Collaborative machine learning and related techniques such as federated learning allow multiple participants, each with his own training dataset, to build a joint model by training locally and periodically exchanging model updates. We demonstrate that these updates leak unintended information about participants' training data and develop passive and active inference attacks to exploit this leakage. First, we show that an adversarial participant can infer the presence of exact data points -- for example, specific locations -- in others' training data (i.e., membership inference). Then, we show how this adversary can infer properties that hold only for a subset of the training data and are independent of the properties that the joint model aims to capture. For example, he can infer when a specific person first appears in the photos used to train a binary gender classifier. We evaluate our attacks on a variety of tasks, datasets, and learning configurations, analyze their limitations, and discuss possible defenses. △ Less

Submitted 1 November, 2018; v1 submitted 10 May, 2018; originally announced May 2018.

Comments: Proceedings of 40th IEEE Symposium on Security & Privacy (S&P 2019)

arXiv:1709.04514 [pdf, other]

Differentially Private Mixture of Generative Neural Networks

Authors: Gergely Acs, Luca Melis, Claude Castelluccia, Emiliano De Cristofaro

Abstract: Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional… ▽ More Generative models are used in a wide range of applications building on large amounts of contextually rich information. Due to possible privacy violations of the individuals whose data is used to train these models, however, publishing or sharing generative models is not always viable. In this paper, we present a novel technique for privately releasing generative models and entire high-dimensional datasets produced by these models. We model the generator distribution of the training data with a mixture of $k$ generative neural networks. These are trained together and collectively learn the generator distribution of a dataset. Data is divided into $k$ clusters, using a novel differentially private kernel $k$-means, then each cluster is given to separate generative neural networks, such as Restricted Boltzmann Machines or Variational Autoencoders, which are trained only on their own cluster using differentially private gradient descent. We evaluate our approach using the MNIST dataset, as well as call detail records and transit datasets, showing that it produces realistic synthetic samples, which can also be used to accurately compute arbitrary number of counting queries. △ Less

Submitted 13 July, 2018; v1 submitted 13 September, 2017; originally announced September 2017.

Comments: A shorter version of this paper appeared at the 17th IEEE International Conference on Data Mining (ICDM 2017). This is the full version, published in IEEE Transactions on Knowledge and Data Engineering (TKDE)

arXiv:1705.07663 [pdf, other]

LOGAN: Membership Inference Attacks Against Generative Models

Authors: Jamie Hayes, Luca Melis, George Danezis, Emiliano De Cristofaro

Abstract: Generative models estimate the underlying distribution of a dataset to generate realistic samples according to that distribution. In this paper, we present the first membership inference attacks against generative models: given a data point, the adversary determines whether or not it was used to train the model. Our attacks leverage Generative Adversarial Networks (GANs), which combine a discrimin… ▽ More Generative models estimate the underlying distribution of a dataset to generate realistic samples according to that distribution. In this paper, we present the first membership inference attacks against generative models: given a data point, the adversary determines whether or not it was used to train the model. Our attacks leverage Generative Adversarial Networks (GANs), which combine a discriminative and a generative model, to detect overfitting and recognize inputs that were part of training datasets, using the discriminator's capacity to learn statistical differences in distributions. We present attacks based on both white-box and black-box access to the target model, against several state-of-the-art generative models, over datasets of complex representations of faces (LFW), objects (CIFAR-10), and medical images (Diabetic Retinopathy). We also discuss the sensitivity of the attacks to different training parameters, and their robustness against mitigation strategies, finding that defenses are either ineffective or lead to significantly worse performances of the generative models in terms of training stability and/or sample quality. △ Less

Submitted 21 August, 2018; v1 submitted 22 May, 2017; originally announced May 2017.

Journal ref: Proceedings on Privacy Enhancing Technologies (PoPETs), Vol. 2019, Issue 1

arXiv:1605.03772 [pdf, other]

SplitBox: Toward Efficient Private Network Function Virtualization

Authors: Hassan Jameel Asghar, Luca Melis, Cyril Soldani, Emiliano De Cristofaro, Mohamed Ali Kaafar, Laurent Mathy

Abstract: This paper presents SplitBox, a scalable system for privately processing network functions that are outsourced as software processes to the cloud. Specifically, providers processing the network functions do not learn the network policies instructing how the functions are to be processed. We first propose an abstract model of a generic network function based on match-action pairs, assuming that thi… ▽ More This paper presents SplitBox, a scalable system for privately processing network functions that are outsourced as software processes to the cloud. Specifically, providers processing the network functions do not learn the network policies instructing how the functions are to be processed. We first propose an abstract model of a generic network function based on match-action pairs, assuming that this is processed in a distributed manner by multiple honest-but-curious providers. Then, we introduce our SplitBox system for private network function virtualization and present a proof-of-concept implementation on FastClick -- an extension of the Click modular router -- using a firewall as a use case. Our experimental results show that SplitBox achieves a throughput of over 2 Gbps with 1 kB-sized packets on average, traversing up to 60 firewall rules. △ Less

Submitted 12 May, 2016; originally announced May 2016.

Comments: An earlier version of this paper appears in the Proceedings of the ACM SIGCOMM Workshop on Hot Topics in Middleboxes and Network Function Virtualization (HotMiddleBox 2016). This is the full version

arXiv:1601.06454 [pdf, other]

doi 10.1145/2876019.2876021

Private Processing of Outsourced Network Functions: Feasibility and Constructions

Authors: Luca Melis, Hassan Jameel Asghar, Emiliano De Cristofaro, Mohamed Ali Kaafar

Abstract: Aiming to reduce the cost and complexity of maintaining networking infrastructures, organizations are increasingly outsourcing their network functions (e.g., firewalls, traffic shapers and intrusion detection systems) to the cloud, and a number of industrial players have started to offer network function virtualization (NFV)-based solutions. Alas, outsourcing network functions in its current setti… ▽ More Aiming to reduce the cost and complexity of maintaining networking infrastructures, organizations are increasingly outsourcing their network functions (e.g., firewalls, traffic shapers and intrusion detection systems) to the cloud, and a number of industrial players have started to offer network function virtualization (NFV)-based solutions. Alas, outsourcing network functions in its current setting implies that sensitive network policies, such as firewall rules, are revealed to the cloud provider. In this paper, we investigate the use of cryptographic primitives for processing outsourced network functions, so that the provider does not learn any sensitive information. More specifically, we present a cryptographic treatment of privacy-preserving outsourcing of network functions, introducing security definitions as well as an abstract model of generic network functions, and then propose a few instantiations using partial homomorphic encryption and public-key encryption with keyword search. We include a proof-of-concept implementation of our constructions and show that network functions can be privately processed by an untrusted cloud provider in a few milliseconds. △ Less

Submitted 24 January, 2016; originally announced January 2016.

Comments: A preliminary version of this paper appears in the 1st ACM International Workshop on Security in Software Defined Networks & Network Function Virtualization. This is the full version

arXiv:1512.04114

Building and Measuring Privacy-Preserving Predictive Blacklists

Authors: Luca Melis, Apostolos Pyrgelis, Emiliano De Cristofaro

Abstract: (Withdrawn) Collaborative security initiatives are increasingly often advocated to improve timeliness and effectiveness of threat mitigation. Among these, collaborative predictive blacklisting (CPB) aims to forecast attack sources based on alerts contributed by multiple organizations that might be targeted in similar ways. Alas, CPB proposals thus far have only focused on improving hit counts, but… ▽ More (Withdrawn) Collaborative security initiatives are increasingly often advocated to improve timeliness and effectiveness of threat mitigation. Among these, collaborative predictive blacklisting (CPB) aims to forecast attack sources based on alerts contributed by multiple organizations that might be targeted in similar ways. Alas, CPB proposals thus far have only focused on improving hit counts, but overlooked the impact of collaboration on false positives and false negatives. Moreover, sharing threat intelligence often prompts important privacy, confidentiality, and liability issues. In this paper, we first provide a comprehensive measurement analysis of two state-of-the-art CPB systems: one that uses a trusted central party to collect alerts [Soldo et al., Infocom'10] and a peer-to-peer one relying on controlled data sharing [Freudiger et al., DIMVA'15], studying the impact of collaboration on both correct and incorrect predictions. Then, we present a novel privacy-friendly approach that significantly improves over previous work, achieving a better balance of true and false positive rates, while minimizing information disclosure. Finally, we present an extension that allows our system to scale to very large numbers of organizations. △ Less

Submitted 7 October, 2018; v1 submitted 13 December, 2015; originally announced December 2015.

Comments: Obsolete paper. For more up-to-date work on collaborative predictive blacklisting, see arXiv:1810.02649

arXiv:1508.06110 [pdf, other]

Efficient Private Statistics with Succinct Sketches

Authors: Luca Melis, George Danezis, Emiliano De Cristofaro

Abstract: Large-scale collection of contextual information is often essential in order to gather statistics, train machine learning models, and extract knowledge from data. The ability to do so in a {\em privacy-preserving} way -- i.e., without collecting fine-grained user data -- enables a number of additional computational scenarios that would be hard, or outright impossible, to realize without strong pri… ▽ More Large-scale collection of contextual information is often essential in order to gather statistics, train machine learning models, and extract knowledge from data. The ability to do so in a {\em privacy-preserving} way -- i.e., without collecting fine-grained user data -- enables a number of additional computational scenarios that would be hard, or outright impossible, to realize without strong privacy guarantees. In this paper, we present the design and implementation of practical techniques for privately gathering statistics from large data streams. We build on efficient cryptographic protocols for private aggregation and on data structures for succinct data representation, namely, Count-Min Sketch and Count Sketch. These allow us to reduce the communication and computation complexity incurred by each data source (e.g., end-users) from linear to logarithmic in the size of their input, while introducing a parametrized upper-bounded error that does not compromise the quality of the statistics. We then show how to use our techniques, efficiently, to instantiate real-world privacy-friendly systems, supporting recommendations for media streaming services, prediction of user locations, and computation of median statistics for Tor hidden services. △ Less

Submitted 6 January, 2016; v1 submitted 25 August, 2015; originally announced August 2015.

Comments: To appear in NDSS 2016

Showing 1–19 of 19 results for author: Melis, L