-
FairSISA: Ensemble Post-Processing to Improve Fairness of Unlearning in LLMs
Authors:
Swanand Ravindra Kadhe,
Anisa Halimi,
Ambrish Rawat,
Nathalie Baracaldo
Abstract:
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the 'unlearning' discipline where…
▽ More
Training large language models (LLMs) is a costly endeavour in terms of time and computational resources. The large amount of training data used during the unsupervised pre-training phase makes it difficult to verify all data and, unfortunately, undesirable data may be ingested during training. Re-training from scratch is impractical and has led to the creation of the 'unlearning' discipline where models are modified to "unlearn" undesirable information without retraining. However, any modification can alter the behaviour of LLMs, especially on key dimensions such as fairness. This is the first work that examines this interplay between unlearning and fairness for LLMs. In particular, we focus on a popular unlearning framework known as SISA [Bourtoule et al., 2021], which creates an ensemble of models trained on disjoint shards. We evaluate the performance-fairness trade-off for SISA, and empirically demsontrate that SISA can indeed reduce fairness in LLMs. To remedy this, we propose post-processing bias mitigation techniques for ensemble models produced by SISA. We adapt the post-processing fairness improvement technique from [Hardt et al., 2016] to design three methods that can handle model ensembles, and prove that one of the methods is an optimal fair predictor for ensemble of models. Through experimental results, we demonstrate the efficacy of our post-processing framework called 'FairSISA'.
△ Less
Submitted 12 December, 2023;
originally announced December 2023.
-
AUTOLYCUS: Exploiting Explainable AI (XAI) for Model Extraction Attacks against White-Box Models
Authors:
Abdullah Caglar Oksuz,
Anisa Halimi,
Erman Ayday
Abstract:
Explainable Artificial Intelligence (XAI) encompasses a range of techniques and procedures aimed at elucidating the decision-making processes of AI models. While XAI is valuable in understanding the reasoning behind AI models, the data used for such revelations poses potential security and privacy vulnerabilities. Existing literature has identified privacy risks targeting machine learning models,…
▽ More
Explainable Artificial Intelligence (XAI) encompasses a range of techniques and procedures aimed at elucidating the decision-making processes of AI models. While XAI is valuable in understanding the reasoning behind AI models, the data used for such revelations poses potential security and privacy vulnerabilities. Existing literature has identified privacy risks targeting machine learning models, including membership inference, model inversion, and model extraction attacks. Depending on the settings and parties involved, such attacks may target either the model itself or the training data used to create the model.
We have identified that tools providing XAI can particularly increase the vulnerability of model extraction attacks, which can be a significant issue when the owner of an AI model prefers to provide only black-box access rather than sharing the model parameters and architecture with other parties. To explore this privacy risk, we propose AUTOLYCUS, a model extraction attack that leverages the explanations provided by popular explainable AI tools. We particularly focus on white-box machine learning (ML) models such as decision trees and logistic regression models.
We have evaluated the performance of AUTOLYCUS on 5 machine learning datasets, in terms of the surrogate model's accuracy and its similarity to the target model. We observe that the proposed attack is highly effective; it requires up to 60x fewer queries to the target model compared to the state-of-the-art attack, while providing comparable accuracy and similarity. We first validate the performance of the proposed algorithm on decision trees, and then show its performance on logistic regression models as an indicator that the proposed algorithm performs well on white-box ML models in general. Finally, we show that the existing countermeasures remain ineffective for the proposed attack.
△ Less
Submitted 6 May, 2023; v1 submitted 4 February, 2023;
originally announced February 2023.
-
Federated Unlearning: How to Efficiently Erase a Client in FL?
Authors:
Anisa Halimi,
Swanand Kadhe,
Ambrish Rawat,
Nathalie Baracaldo
Abstract:
With privacy legislation empowering the users with the right to be forgotten, it has become essential to make a model amenable for forgetting some of its training data. However, existing unlearning methods in the machine learning context can not be directly applied in the context of distributed settings like federated learning due to the differences in learning protocol and the presence of multipl…
▽ More
With privacy legislation empowering the users with the right to be forgotten, it has become essential to make a model amenable for forgetting some of its training data. However, existing unlearning methods in the machine learning context can not be directly applied in the context of distributed settings like federated learning due to the differences in learning protocol and the presence of multiple actors. In this paper, we tackle the problem of federated unlearning for the case of erasing a client by removing the influence of their entire local data from the trained global model. To erase a client, we propose to first perform local unlearning at the client to be erased, and then use the locally unlearned model as the initialization to run very few rounds of federated learning between the server and the remaining clients to obtain the unlearned global model. We empirically evaluate our unlearning method by employing multiple performance measures on three datasets, and demonstrate that our unlearning method achieves comparable performance as the gold standard unlearning method of federated retraining from scratch, while being significantly efficient. Unlike prior works, our unlearning method neither requires global access to the data used for training nor the history of the parameter updates to be stored by the server or any of the clients.
△ Less
Submitted 20 October, 2023; v1 submitted 12 July, 2022;
originally announced July 2022.
-
ShareTrace: Contact Tracing with the Actor Model
Authors:
Ryan Tatton,
Erman Ayday,
Young** Yoo,
Anisa Halimi
Abstract:
Proximity-based contact tracing relies on mobile-device interaction to estimate the spread of disease. ShareTrace is one such approach that improves the efficacy of tracking disease spread by considering direct and indirect forms of contact. In this work, we utilize the actor model to provide an efficient and scalable formulation of ShareTrace with asynchronous, concurrent message passing on a tem…
▽ More
Proximity-based contact tracing relies on mobile-device interaction to estimate the spread of disease. ShareTrace is one such approach that improves the efficacy of tracking disease spread by considering direct and indirect forms of contact. In this work, we utilize the actor model to provide an efficient and scalable formulation of ShareTrace with asynchronous, concurrent message passing on a temporal contact network. We also introduce message reachability, an extension of temporal reachability that accounts for network topology and message-passing semantics. Our evaluation on both synthetic and real-world contact networks indicates that correct parameter values optimize for algorithmic accuracy and efficiency. In addition, we demonstrate that message reachability can accurately estimate the risk a user poses to their contacts.
△ Less
Submitted 18 September, 2022; v1 submitted 23 March, 2022;
originally announced March 2022.
-
Facilitating Federated Genomic Data Analysis by Identifying Record Correlations while Ensuring Privacy
Authors:
Leonard Dervishi,
Xinyue Wang,
Wentao Li,
Anisa Halimi,
Jaideep Vaidya,
Xiaoqian Jiang,
Erman Ayday
Abstract:
With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore,…
▽ More
With the reduction of sequencing costs and the pervasiveness of computing devices, genomic data collection is continually growing. However, data collection is highly fragmented and the data is still siloed across different repositories. Analyzing all of this data would be transformative for genomics research. However, the data is sensitive, and therefore cannot be easily centralized. Furthermore, there may be correlations in the data, which if not detected, can impact the analysis. In this paper, we take the first step towards identifying correlated records across multiple data repositories in a privacy-preserving manner. The proposed framework, based on random shuffling, synthetic record generation, and local differential privacy, allows a trade-off of accuracy and computational efficiency. An extensive evaluation on real genomic data from the OpenSNP dataset shows that the proposed solution is efficient and effective.
△ Less
Submitted 10 March, 2022;
originally announced March 2022.
-
A Bayesian Based Deep Unrolling Algorithm for Single-Photon Lidar Systems
Authors:
Jakeoung Koo,
Abderrahim Halimi,
Stephen McLaughlin
Abstract:
Deploying 3D single-photon Lidar imaging in real world applications faces multiple challenges including imaging in high noise environments. Several algorithms have been proposed to address these issues based on statistical or learning-based frameworks. Statistical methods provide rich information about the inferred parameters but are limited by the assumed model correlation structures, while deep…
▽ More
Deploying 3D single-photon Lidar imaging in real world applications faces multiple challenges including imaging in high noise environments. Several algorithms have been proposed to address these issues based on statistical or learning-based frameworks. Statistical methods provide rich information about the inferred parameters but are limited by the assumed model correlation structures, while deep learning methods show state-of-the-art performance but limited inference guarantees, preventing their extended use in critical applications. This paper unrolls a statistical Bayesian algorithm into a new deep learning architecture for robust image reconstruction from single-photon Lidar data, i.e., the algorithm's iterative steps are converted into neural network layers. The resulting algorithm benefits from the advantages of both statistical and learning based frameworks, providing best estimates with improved network interpretability. Compared to existing learning-based solutions, the proposed architecture requires a reduced number of trainable parameters, is more robust to noise and mismodelling effects, and provides richer information about the estimates including uncertainty measures. Results on synthetic and real data show competitive results regarding the quality of the inference and computational complexity when compared to state-of-the-art algorithms.
△ Less
Submitted 26 January, 2022;
originally announced January 2022.
-
Real-time, low-cost multi-person 3D pose estimation
Authors:
Alice Ruget,
Max Tyler,
Germán Mora Martín,
Stirling Scholes,
Feng Zhu,
Istvan Gyongy,
Brent Hearn,
Steve McLaughlin,
Abderrahim Halimi,
Jonathan Leach
Abstract:
The process of tracking human anatomy in computer vision is referred to pose estimation, and it is used in fields ranging from gaming to surveillance. Three-dimensional pose estimation traditionally requires advanced equipment, such as multiple linked intensity cameras or high-resolution time-of-flight cameras to produce depth images. However, there are applications, e.g.~consumer electronics, whe…
▽ More
The process of tracking human anatomy in computer vision is referred to pose estimation, and it is used in fields ranging from gaming to surveillance. Three-dimensional pose estimation traditionally requires advanced equipment, such as multiple linked intensity cameras or high-resolution time-of-flight cameras to produce depth images. However, there are applications, e.g.~consumer electronics, where significant constraints are placed on the size, power consumption, weight and cost of the usable technology. Here, we demonstrate that computational imaging methods can achieve accurate pose estimation and overcome the apparent limitations of time-of-flight sensors designed for much simpler tasks. The sensor we use is already widely integrated in consumer-grade mobile devices, and despite its low spatial resolution, only 4$\times$4 pixels, our proposed Pixels2Pose system transforms its data into accurate depth maps and 3D pose data of multiple people up to a distance of 3 m from the sensor. We are able to generate depth maps at a resolution of 32$\times$32 and 3D localization of a body parts with an error of only $\approx$10 cm at a frame rate of 7 fps. This work opens up promising real-life applications in scenarios that were previously restricted by the advanced hardware requirements and cost of time-of-flight technology.
△ Less
Submitted 24 August, 2022; v1 submitted 11 October, 2021;
originally announced October 2021.
-
High-speed object detection with a single-photon time-of-flight image sensor
Authors:
Germán Mora-Martín,
Alex Turpin,
Alice Ruget,
Abderrahim Halimi,
Robert Henderson,
Jonathan Leach,
Istvan Gyongy
Abstract:
3D time-of-flight (ToF) imaging is used in a variety of applications such as augmented reality (AR), computer interfaces, robotics and autonomous systems. Single-photon avalanche diodes (SPADs) are one of the enabling technologies providing accurate depth data even over long ranges. By develo** SPADs in array format with integrated processing combined with pulsed, flood-type illumination, high-s…
▽ More
3D time-of-flight (ToF) imaging is used in a variety of applications such as augmented reality (AR), computer interfaces, robotics and autonomous systems. Single-photon avalanche diodes (SPADs) are one of the enabling technologies providing accurate depth data even over long ranges. By develo** SPADs in array format with integrated processing combined with pulsed, flood-type illumination, high-speed 3D capture is possible. However, array sizes tend to be relatively small, limiting the lateral resolution of the resulting depth maps, and, consequently, the information that can be extracted from the image for applications such as object detection. In this paper, we demonstrate that these limitations can be overcome through the use of convolutional neural networks (CNNs) for high-performance object detection. We present outdoor results from a portable SPAD camera system that outputs 16-bin photon timing histograms with 64x32 spatial resolution. The results, obtained with exposure times down to 2 ms (equivalent to 500 FPS) and in signal-to-background (SBR) ratios as low as 0.05, point to the advantages of providing the CNN with full histogram data rather than point clouds alone. Alternatively, a combination of point cloud and active intensity data may be used as input, for a similar level of performance. In either case, the GPU-accelerated processing time is less than 1 ms per frame, leading to an overall latency (image acquisition plus processing) in the millisecond range, making the results relevant for safety-critical computer vision applications which would benefit from faster than human reaction times.
△ Less
Submitted 28 July, 2021;
originally announced July 2021.
-
Privacy-Preserving and Efficient Verification of the Outcome in Genome-Wide Association Studies
Authors:
Anisa Halimi,
Leonard Dervishi,
Erman Ayday,
Apostolos Pyrgelis,
Juan Ramon Troncoso-Pastoriza,
Jean-Pierre Hubaux,
Xiaoqian Jiang,
Jaideep Vaidya
Abstract:
Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. Workflow systems model and record provenance describing the steps performed to obtain the final results of a computation. In this work, we propose a framework that verifies the correctness of the statistical test results that are conducted by a researcher while protecting individuals' privacy i…
▽ More
Providing provenance in scientific workflows is essential for reproducibility and auditability purposes. Workflow systems model and record provenance describing the steps performed to obtain the final results of a computation. In this work, we propose a framework that verifies the correctness of the statistical test results that are conducted by a researcher while protecting individuals' privacy in the researcher's dataset. The researcher publishes the workflow of the conducted study, its output, and associated metadata. They keep the research dataset private while providing, as part of the metadata, a partial noisy dataset (that achieves local differential privacy). To check the correctness of the workflow output, a verifier makes use of the workflow, its metadata, and results of another statistical study (using publicly available datasets) to distinguish between correct statistics and incorrect ones. We use case the proposed framework in the genome-wide association studies (GWAS), in which the goal is to identify highly associated point mutations (variants) with a given phenotype. For evaluation, we use real genomic data and show that the correctness of the workflow output can be verified with high accuracy even when the aggregate statistics of a small number of variants are provided. We also quantify the privacy leakage due to the provided workflow and its associated metadata in the GWAS use-case and show that the additional privacy risk due to the provided metadata does not increase the existing privacy risk due to sharing of the research results. Thus, our results show that the workflow output (i.e., research results) can be verified with high confidence in a privacy-preserving way. We believe that this work will be a valuable step towards providing provenance in a privacy-preserving way while providing guarantees to the users about the correctness of the results.
△ Less
Submitted 7 November, 2022; v1 submitted 21 January, 2021;
originally announced January 2021.
-
Robust super-resolution depth imaging via a multi-feature fusion deep network
Authors:
Alice Ruget,
Stephen McLaughlin,
Robert K. Henderson,
Istvan Gyongy,
Abderrahim Halimi,
Jonathan Leach
Abstract:
Three-dimensional imaging plays an important role in imaging applications where it is necessary to record depth. The number of applications that use depth imaging is increasing rapidly, and examples include self-driving autonomous vehicles and auto-focus assist on smartphone cameras. Light detection and ranging (LIDAR) via single-photon sensitive detector (SPAD) arrays is an emerging technology th…
▽ More
Three-dimensional imaging plays an important role in imaging applications where it is necessary to record depth. The number of applications that use depth imaging is increasing rapidly, and examples include self-driving autonomous vehicles and auto-focus assist on smartphone cameras. Light detection and ranging (LIDAR) via single-photon sensitive detector (SPAD) arrays is an emerging technology that enables the acquisition of depth images at high frame rates. However, the spatial resolution of this technology is typically low in comparison to the intensity images recorded by conventional cameras. To increase the native resolution of depth images from a SPAD camera, we develop a deep network built specifically to take advantage of the multiple features that can be extracted from a camera's histogram data. The network is designed for a SPAD camera operating in a dual-mode such that it captures alternate low resolution depth and high resolution intensity images at high frame rates, thus the system does not require any additional sensor to provide intensity images. The network then uses the intensity images and multiple features extracted from downsampled histograms to guide the upsampling of the depth. Our network provides significant image resolution enhancement and image denoising across a wide range of signal-to-noise ratios and photon levels. We apply the network to a range of 3D data, demonstrating denoising and a four-fold resolution enhancement of depth.
△ Less
Submitted 1 February, 2021; v1 submitted 20 November, 2020;
originally announced November 2020.
-
Efficient Quantification of Profile Matching Risk in Social Networks
Authors:
Anisa Halimi,
Erman Ayday
Abstract:
Anonymous data sharing has been becoming more challenging in today's interconnected digital world, especially for individuals that have both anonymous and identified online activities. The most prominent example of such data sharing platforms today are online social networks (OSNs). Many individuals have multiple profiles in different OSNs, including anonymous and identified ones (depending on the…
▽ More
Anonymous data sharing has been becoming more challenging in today's interconnected digital world, especially for individuals that have both anonymous and identified online activities. The most prominent example of such data sharing platforms today are online social networks (OSNs). Many individuals have multiple profiles in different OSNs, including anonymous and identified ones (depending on the nature of the OSN). Here, the privacy threat is profile matching: if an attacker links anonymous profiles of individuals to their real identities, it can obtain privacy-sensitive information which may have serious consequences, such as discrimination or blackmailing. Therefore, it is very important to quantify and show to the OSN users the extent of this privacy risk. Existing attempts to model profile matching in OSNs are inadequate and computationally inefficient for real-time risk quantification. Thus, in this work, we develop algorithms to efficiently model and quantify profile matching attacks in OSNs as a step towards real-time privacy risk quantification. For this, we model the profile matching problem using a graph and develop a belief propagation (BP)-based algorithm to solve this problem in a significantly more efficient and accurate way compared to the state-of-the-art. We evaluate the proposed framework on three real-life datasets (including data from four different social networks) and show how users' profiles in different OSNs can be matched efficiently and with high probability. We show that the proposed model generation has linear complexity in terms of number of user pairs, which is significantly more efficient than the state-of-the-art (which has cubic complexity). Furthermore, it provides comparable accuracy, precision, and recall compared to state-of-the-art.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.
-
Profile Matching Across Online Social Networks
Authors:
Anisa Halimi,
Erman Ayday
Abstract:
In this work, we study the privacy risk due to profile matching across online social networks (OSNs), in which anonymous profiles of OSN users are matched to their real identities using auxiliary information about them. We consider different attributes that are publicly shared by users. Such attributes include both strong identifiers such as user name and weak identifiers such as interest or senti…
▽ More
In this work, we study the privacy risk due to profile matching across online social networks (OSNs), in which anonymous profiles of OSN users are matched to their real identities using auxiliary information about them. We consider different attributes that are publicly shared by users. Such attributes include both strong identifiers such as user name and weak identifiers such as interest or sentiment variation between different posts of a user in different platforms. We study the effect of using different combinations of these attributes to profile matching in order to show the privacy threat in an extensive way. The proposed framework mainly relies on machine learning techniques and optimization algorithms. We evaluate the proposed framework on three datasets (Twitter - Foursquare, Google+ - Twitter, and Flickr) and show how profiles of the users in different OSNs can be matched with high probability by using the publicly shared attributes and/or the underlying graphical structure of the OSNs. We also show that the proposed framework notably provides higher precision values compared to state-of-the-art that relies on machine learning techniques. We believe that this work will be a valuable step to build a tool for the OSN users to understand their privacy risks due to their public sharings.
△ Less
Submitted 19 August, 2020;
originally announced August 2020.
-
Profile Matching Across Unstructured Online Social Networks: Threats and Countermeasures
Authors:
Anisa Halimi,
Erman Ayday
Abstract:
In this work, we propose a profile matching (or deanonymization) attack for unstructured online social networks (OSNs) in which similarity in graphical structure cannot be used for profile matching. We consider different attributes that are publicly shared by users. Such attributes include both obvious identifiers such as the user name and non-obvious identifiers such as interest similarity or sen…
▽ More
In this work, we propose a profile matching (or deanonymization) attack for unstructured online social networks (OSNs) in which similarity in graphical structure cannot be used for profile matching. We consider different attributes that are publicly shared by users. Such attributes include both obvious identifiers such as the user name and non-obvious identifiers such as interest similarity or sentiment variation between different posts of a user in different platforms. We study the effect of using different combinations of these attributes to the profile matching in order to show the privacy threat in an extensive way. Our proposed framework mainly relies on machine learning techniques and optimization algorithms. We evaluate the proposed framework on two real-life datasets that are constructed by us. Our results indicate that profiles of the users in different OSNs can be matched with high probability by only using publicly shared attributes and without using the underlying graphical structure of the OSNs. We also propose possible countermeasures to mitigate this threat in the expense of reduction in the accuracy (or utility) of the attributes shared by the users. We formulate the tradeoff between the privacy and profile utility of the users as an optimization problem and show how slight changes in the profiles of the users would reduce the success of the attack. We believe that this work will be a valuable step to build a privacy-preserving tool for users against profile matching attacks between OSNs.
△ Less
Submitted 6 November, 2017;
originally announced November 2017.
-
Correntropy Maximization via ADMM - Application to Robust Hyperspectral Unmixing
Authors:
Fei Zhu,
Abderrahim Halimi,
Paul Honeine,
Badong Chen,
Nanning Zheng
Abstract:
In hyperspectral images, some spectral bands suffer from low signal-to-noise ratio due to noisy acquisition and atmospheric effects, thus requiring robust techniques for the unmixing problem. This paper presents a robust supervised spectral unmixing approach for hyperspectral images. The robustness is achieved by writing the unmixing problem as the maximization of the correntropy criterion subject…
▽ More
In hyperspectral images, some spectral bands suffer from low signal-to-noise ratio due to noisy acquisition and atmospheric effects, thus requiring robust techniques for the unmixing problem. This paper presents a robust supervised spectral unmixing approach for hyperspectral images. The robustness is achieved by writing the unmixing problem as the maximization of the correntropy criterion subject to the most commonly used constraints. Two unmixing problems are derived: the first problem considers the fully-constrained unmixing, with both the non-negativity and sum-to-one constraints, while the second one deals with the non-negativity and the sparsity-promoting of the abundances. The corresponding optimization problems are solved efficiently using an alternating direction method of multipliers (ADMM) approach. Experiments on synthetic and real hyperspectral images validate the performance of the proposed algorithms for different scenarios, demonstrating that the correntropy-based unmixing is robust to outlier bands.
△ Less
Submitted 4 February, 2016;
originally announced February 2016.
-
Estimating the Intrinsic Dimension of Hyperspectral Images Using an Eigen-Gap Approach
Authors:
A. Halimi,
P. Honeine,
M. Kharouf,
C. Richard,
J. -Y. Tourneret
Abstract:
Linear mixture models are commonly used to represent hyperspectral datacube as a linear combinations of endmember spectra. However, determining of the number of endmembers for images embedded in noise is a crucial task. This paper proposes a fully automatic approach for estimating the number of endmembers in hyperspectral images. The estimation is based on recent results of random matrix theory re…
▽ More
Linear mixture models are commonly used to represent hyperspectral datacube as a linear combinations of endmember spectra. However, determining of the number of endmembers for images embedded in noise is a crucial task. This paper proposes a fully automatic approach for estimating the number of endmembers in hyperspectral images. The estimation is based on recent results of random matrix theory related to the so-called spiked population model. More precisely, we study the gap between successive eigenvalues of the sample covariance matrix constructed from high dimensional noisy samples. The resulting estimation strategy is unsupervised and robust to correlated noise. This strategy is validated on both synthetic and real images. The experimental results are very promising and show the accuracy of this algorithm with respect to state-of-the-art algorithms.
△ Less
Submitted 22 January, 2015;
originally announced January 2015.