Search | arXiv e-print repository

Holistic Survey of Privacy and Fairness in Machine Learning

Authors: Sina Shaham, Arash Hajisafi, Minh K Quan, Dinh C Nguyen, Bhaskar Krishnamachari, Charith Peris, Gabriel Ghinita, Cyrus Shahabi, Pubudu N. Pathirana

Abstract: Privacy and fairness are two crucial pillars of responsible Artificial Intelligence (AI) and trustworthy Machine Learning (ML). Each objective has been independently studied in the literature with the aim of reducing utility loss in achieving them. Despite the significant interest attracted from both academia and industry, there remains an immediate demand for more in-depth research to unravel how… ▽ More Privacy and fairness are two crucial pillars of responsible Artificial Intelligence (AI) and trustworthy Machine Learning (ML). Each objective has been independently studied in the literature with the aim of reducing utility loss in achieving them. Despite the significant interest attracted from both academia and industry, there remains an immediate demand for more in-depth research to unravel how these two objectives can be simultaneously integrated into ML models. As opposed to well-accepted trade-offs, i.e., privacy-utility and fairness-utility, the interrelation between privacy and fairness is not well-understood. While some works suggest a trade-off between the two objective functions, there are others that demonstrate the alignment of these functions in certain scenarios. To fill this research gap, we provide a thorough review of privacy and fairness in ML, including supervised, unsupervised, semi-supervised, and reinforcement learning. After examining and consolidating the literature on both objectives, we present a holistic survey on the impact of privacy on fairness, the impact of fairness on privacy, existing architectures, their interaction in application domains, and algorithms that aim to achieve both objectives while minimizing the utility sacrificed. Finally, we identify research challenges in achieving privacy and fairness concurrently in ML, particularly focusing on large language models. △ Less

Submitted 28 July, 2023; originally announced July 2023.

arXiv:2307.05717 [pdf, other]

Towards Mobility Data Science (Vision Paper)

Authors: Mohamed Mokbel, Mahmoud Sakr, Li Xiong, Andreas Züfle, Jussara Almeida, Taylor Anderson, Walid Aref, Gennady Andrienko, Natalia Andrienko, Yang Cao, Sanjay Chawla, Reynold Cheng, Panos Chrysanthis, Xiqi Fei, Gabriel Ghinita, Anita Graser, Dimitrios Gunopulos, Christian Jensen, Joon-Seok Kim, Kyoung-Sook Kim, Peer Kröger, John Krumm, Johannes Lauer, Amr Magdy, Mario Nascimento , et al. (23 additional authors not shown)

Abstract: Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences… ▽ More Mobility data captures the locations of moving objects such as humans, animals, and cars. With the availability of GPS-equipped mobile devices and other inexpensive location-tracking technologies, mobility data is collected ubiquitously. In recent years, the use of mobility data has demonstrated significant impact in various domains including traffic management, urban planning, and health sciences. In this paper, we present the emerging domain of mobility data science. Towards a unified approach to mobility data science, we envision a pipeline having the following components: mobility data collection, cleaning, analysis, management, and privacy. For each of these components, we explain how mobility data science differs from general data science, we survey the current state of the art and describe open challenges for the research community in the coming years. △ Less

Submitted 7 March, 2024; v1 submitted 21 June, 2023; originally announced July 2023.

Comments: Updated to reflect the major revision for ACM Transactions on Spatial Algorithms and Systems (TSAS). This version reflects the final version accepted by ACM TSAS

arXiv:2302.02306 [pdf, ps, other]

Fair Spatial Indexing: A paradigm for Group Spatial Fairness

Authors: Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Abstract: Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes… ▽ More Machine learning (ML) is playing an increasing role in decision-making tasks that directly affect individuals, e.g., loan approvals, or job applicant screening. Significant concerns arise that, without special provisions, individuals from under-privileged backgrounds may not get equitable access to services and opportunities. Existing research studies fairness with respect to protected attributes such as gender, race or income, but the impact of location data on fairness has been largely overlooked. With the widespread adoption of mobile apps, geospatial attributes are increasingly used in ML, and their potential to introduce unfair bias is significant, given their high correlation with protected attributes. We propose techniques to mitigate location bias in machine learning. Specifically, we consider the issue of miscalibration when dealing with geospatial attributes. We focus on spatial group fairness and we propose a spatial indexing algorithm that accounts for fairness. Our KD-tree inspired approach significantly improves fairness while maintaining high learning accuracy, as shown by extensive experimental results on real data. △ Less

Submitted 5 February, 2023; originally announced February 2023.

arXiv:2301.06238 [pdf, other]

Supporting Secure Dynamic Alert Zones Using Searchable Encryption and Graph Embedding

Authors: Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Abstract: Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ Searchable Encryption (SE) to achieve secure alerts directly on encrypted… ▽ More Location-based alerts have gained increasing popularity in recent years, whether in the context of healthcare (e.g., COVID-19 contact tracing), marketing (e.g., location-based advertising), or public safety. However, serious privacy concerns arise when location data are used in clear in the process. Several solutions employ Searchable Encryption (SE) to achieve secure alerts directly on encrypted locations. While doing so preserves privacy, the performance overhead incurred is high. We focus on a prominent SE technique in the public-key setting -- Hidden Vector Encryption (HVE), and propose a graph embedding technique to encode location data in a way that significantly boosts the performance of processing on ciphertexts. We show that finding the optimal encoding is NP-hard, and provide several heuristics that are fast and obtain significant performance gains. Furthermore, we investigate the more challenging case of dynamic alert zones, where the area of interest changes over time. Our extensive experimental evaluation shows that our solutions can significantly improve computational overhead compared to existing baselines. △ Less

Submitted 15 January, 2023; originally announced January 2023.

arXiv:2208.09744 [pdf, other]

A Neural Approach to Spatio-Temporal Data Release with User-Level Differential Privacy

Authors: Ritesh Ahuja, Sepanta Zeighami, Gabriel Ghinita, Cyrus Shahabi

Abstract: Several companies (e.g., Meta, Google) have initiated "data-for-good" projects where aggregate location data are first sanitized and released publicly, which is useful to many applications in transportation, public health (e.g., COVID-19 spread) and urban planning. Differential privacy (DP) is the protection model of choice to ensure the privacy of the individuals who generated the raw location da… ▽ More Several companies (e.g., Meta, Google) have initiated "data-for-good" projects where aggregate location data are first sanitized and released publicly, which is useful to many applications in transportation, public health (e.g., COVID-19 spread) and urban planning. Differential privacy (DP) is the protection model of choice to ensure the privacy of the individuals who generated the raw location data. However, current solutions fail to preserve data utility when each individual contributes multiple location reports (i.e., under user-level privacy). To offset this limitation, public releases by Meta and Google use high privacy budgets (e.g., $ε$=10-100), resulting in poor privacy. We propose a novel approach to release spatio-temporal data privately and accurately. We employ the pattern recognition power of neural networks, specifically variational auto-encoders (VAE), to reduce the noise introduced by DP mechanisms such that accuracy is increased, while the privacy requirement is still satisfied. Our extensive experimental evaluation on real datasets shows the clear superiority of our approach compared to benchmarks. △ Less

Submitted 20 August, 2022; originally announced August 2022.

Comments: SIGMOD 2023

arXiv:2204.01880 [pdf, other]

Models and Mechanisms for Spatial Data Fairness

Authors: Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Abstract: Fairness in data-driven decision-making studies scenarios where individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In location-based applications, decisions are based on individual whereabouts, which often correlate with sensitive attributes such as race, income, and educat… ▽ More Fairness in data-driven decision-making studies scenarios where individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In location-based applications, decisions are based on individual whereabouts, which often correlate with sensitive attributes such as race, income, and education. While fairness has received significant attention recently, e.g., in machine learning, there is little focus on achieving fairness when dealing with location data. Due to their characteristics and specific type of processing algorithms, location data pose important fairness challenges. We introduce the concept of spatial data fairness to address the specific challenges of location data and spatial queries. We devise a novel building block to achieve fairness in the form of fair polynomials. Next, we propose two mechanisms based on fair polynomials that achieve individual spatial fairness, corresponding to two common location-based decision-making types: distance-based and zone-based. Extensive experimental results on real data show that the proposed mechanisms achieve spatial fairness without sacrificing utility. △ Less

Submitted 17 October, 2022; v1 submitted 4 April, 2022; originally announced April 2022.

arXiv:2202.12342 [pdf, other]

Differentially-Private Publication of Origin-Destination Matrices with Intermediate Stops

Authors: Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Abstract: Conventional origin-destination (OD) matrices record the count of trips between pairs of start and end locations, and have been extensively used in transportation, traffic planning, etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling, it is increasingly important to also record intermediate points along an individual's path, rather than only the trip start and e… ▽ More Conventional origin-destination (OD) matrices record the count of trips between pairs of start and end locations, and have been extensively used in transportation, traffic planning, etc. More recently, due to use case scenarios such as COVID-19 pandemic spread modeling, it is increasingly important to also record intermediate points along an individual's path, rather than only the trip start and end points. This can be achieved by using a multi-dimensional frequency matrix over a data space partitioning at the desired level of granularity. However, serious privacy constraints occur when releasing OD matrix data, and especially when adding multiple intermediate points, which makes individual trajectories more distinguishable to an attacker. To address this threat, we propose a technique for privacy-preserving publication of multi-dimensional OD matrices that achieves differential privacy (DP), the de-facto standard in private data release. We propose a family of approaches that factor in important data properties such as data density and homogeneity in order to build OD matrices that provide provable protection guarantees while preserving query accuracy. Extensive experiments on real and synthetic datasets show that the proposed approaches clearly outperform existing state-of-the-art. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2108.01496 [pdf, other]

A Neural Database for Differentially Private Spatial Range Queries

Authors: Sepanta Zeighami, Ritesh Ahuja, Gabriel Ghinita, Cyrus Shahabi

Abstract: Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall sho… ▽ More Mobile apps and location-based services generate large amounts of location data that can benefit research on traffic optimization, context-aware notifications and public health (e.g., spread of contagious diseases). To preserve individual privacy, one must first sanitize location data, which is commonly done using the powerful differential privacy (DP) concept. However, existing solutions fall short of properly capturing density patterns and correlations that are intrinsic to spatial data, and as a result yield poor accuracy. We propose a machine-learning based approach for answering statistical queries on location data with DP guarantees. We focus on countering the main source of error that plagues existing approaches (namely, uniformity error), and we design a neural database system that models spatial datasets such that important density and correlation features present in the data are preserved, even when DP-compliant noise is added. We employ a set of neural networks that learn from diverse regions of the dataset and at varying granularities, leading to superior accuracy. We also devise a framework for effective system parameter tuning on top of public data, which helps practitioners set important system parameters without having to expend scarce privacy budget. Extensive experimental results on real datasets with heterogeneous characteristics show that our proposed approach significantly outperforms the state of the art. △ Less

Submitted 3 August, 2021; originally announced August 2021.

arXiv:2107.13749 [pdf, other]

HTF: Homogeneous Tree Framework for Differentially-Private Release of Location Data

Authors: Sina Shaham, Gabriel Ghinita, Ritesh Ahuja, John Krumm, Cyrus Shahabi

Abstract: Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such queries can be answered efficiently by building histograms. However, precise histograms can expose sensitive details about individual users. Different… ▽ More Mobile apps that use location data are pervasive, spanning domains such as transportation, urban planning and healthcare. Important use cases for location data rely on statistical queries, e.g., identifying hotspots where users work and travel. Such queries can be answered efficiently by building histograms. However, precise histograms can expose sensitive details about individual users. Differential privacy (DP) is a mature and widely-adopted protection model, but most approaches for DP-compliant histograms work in a data-independent fashion, leading to poor accuracy. The few proposed data-dependent techniques attempt to adjust histogram partitions based on dataset characteristics, but they do not perform well due to the addition of noise required to achieve DP. We identify density homogeneity as a main factor driving the accuracy of DP-compliant histograms, and we build a data structure that splits the space such that data density is homogeneous within each resulting partition. We show through extensive experiments on large-scale real-world data that the proposed approach achieves superior accuracy compared to existing approaches. △ Less

Submitted 29 July, 2021; originally announced July 2021.

arXiv:2105.00618 [pdf, other]

An Efficient and Secure Location-based Alert Protocol using Searchable Encryption and Huffman Codes

Authors: Sina Shaham, Gabriel Ghinita, Cyrus Shahabi

Abstract: Location data are widely used in mobile apps, ranging from location-based recommendations, to social media and navigation. A specific type of interaction is that of location-based alerts, where mobile users subscribe to a service provider (SP) in order to be notified when a certain event occurs nearby. Consider, for instance, the ongoing COVID-19 pandemic, where contact tracing has been singled ou… ▽ More Location data are widely used in mobile apps, ranging from location-based recommendations, to social media and navigation. A specific type of interaction is that of location-based alerts, where mobile users subscribe to a service provider (SP) in order to be notified when a certain event occurs nearby. Consider, for instance, the ongoing COVID-19 pandemic, where contact tracing has been singled out as an effective means to control the virus spread. Users wish to be notified if they came in proximity to an infected individual. However, serious privacy concerns arise if the users share their location history with the SP in plaintext. To address privacy, recent work proposed several protocols that can securely implement location-based alerts. The users upload their encrypted locations to the SP, and the evaluation of location predicates is done directly on ciphertexts. When a certain individual is reported as infected, all matching ciphertexts are found (e.g., according to a predicate such as "10 feet proximity to any of the locations visited by the infected patient in the last week"), and the corresponding users notified. However, there are significant performance issues associated with existing protocols. The underlying searchable encryption primitives required to perform the matching on ciphertexts are expensive, and without a proper encoding of locations and search predicates, the performance can degrade a lot. In this paper, we propose a novel method for variable-length location encoding based on Huffman codes. By controlling the length required to represent encrypted locations and the corresponding matching predicates, we are able to significantly speed up performance. We provide a theoretical analysis of the gain achieved by using Huffman codes, and we show through extensive experiments that the improvement compared with fixed-length encoding methods is substantial. △ Less

Submitted 2 May, 2021; originally announced May 2021.

arXiv:2004.09005 [pdf, other]

doi 10.1007/s10707-020-00410-1

A Secure Location-based Alert System with Tunable Privacy-Performance Trade-off

Authors: Gabriel Ghinita, Kien Nguyen, Mihai Maruseac, Cyrus Shahabi

Abstract: Monitoring location updates from mobile users has important applications in many areas, ranging from public safety and national security to social networks and advertising. However, sensitive information can be derived from movement patterns, thus protecting the privacy of mobile users is a major concern. Users may only be willing to disclose their locations when some condition is met, for instanc… ▽ More Monitoring location updates from mobile users has important applications in many areas, ranging from public safety and national security to social networks and advertising. However, sensitive information can be derived from movement patterns, thus protecting the privacy of mobile users is a major concern. Users may only be willing to disclose their locations when some condition is met, for instance in proximity of a disaster area or an event of interest. Currently, such functionality can be achieved using searchable encryption. Such cryptographic primitives provide provable guarantees for privacy, and allow decryption only when the location satisfies some predicate. Nevertheless, they rely on expensive pairing-based cryptography (PBC), of which direct application to the domain of location updates leads to impractical solutions. We propose secure and efficient techniques for private processing of location updates that complement the use of PBC and lead to significant gains in performance by reducing the amount of required pairing operations. We implement two optimizations that further improve performance: materialization of results to expensive mathematical operations, and parallelization. We also propose an heuristic that brings down the computational overhead through enlarging an alert zone by a small factor (given as system parameter), therefore trading off a small and controlled amount of privacy for significant performance gains. Extensive experimental results show that the proposed techniques significantly improve performance compared to the baseline, and reduce the searchable encryption overhead to a level that is practical in a computing environment with reasonable resources, such as the cloud. △ Less

Submitted 19 April, 2020; originally announced April 2020.

Comments: 32 pages, GeoInformatica 2020

arXiv:2003.00051 [pdf, ps, other]

Dynamic Skyline Queries on Encrypted Data Using Result Materialization

Authors: Sepanta Zeighami, Gabriel Ghinita, Cyrus Shahabi

Abstract: Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged th… ▽ More Skyline computation is an increasingly popular query, with broad applicability in domains such as healthcare, travel and finance. Given the recent trend to outsource databases and query evaluation, and due to the proprietary and sometimes highly sensitivity nature of the data (e.g., in healthcare), it is essential to evaluate skylines on encrypted datasets. Several research efforts acknowledged the importance of secure skyline computation, but existing solutions suffer from at least one of the following shortcomings: (i) they only provide ad-hoc security; (ii) they are prohibitively expensive; or (iii) they rely on unrealistic assumptions, such as the presence of multiple non-colluding parties in the protocol. Inspired from solutions for secure nearest-neighbors (NN) computation, we conjecture that the most secure and efficient way to compute skylines is through result materialization. However, this approach is significantly more challenging for skylines than for NN queries. We exhaustively study and provide algorithms for pre-computation of skyline results, and we perform an in-depth theoretical analysis of this process. We show that pre-computing results while minimizing storage overhead is NP-hard, and we provide dynamic programming and greedy heuristics that solve the problem more efficiently, while maintaining storage at reasonable levels. Our algorithms are novel and applicable to plain-text skyline computation, but we focus on the encrypted setting where materialization reduces the cost of skyline computation from hours to seconds. Extensive experiments show that we clearly outperform existing work in terms of performance, and our security analysis proves that we obtain a smaller (and quantifiable) data leakage than competitors. △ Less

Submitted 28 February, 2020; originally announced March 2020.

arXiv:1909.00299 [pdf, other]

doi 10.1145/3347146.3359072

A Privacy-Preserving, Accountable and Spam-Resilient Geo-Marketplace

Authors: Kien Nguyen, Gabriel Ghinita, Muhammad Naveed, Cyrus Shahabi

Abstract: Mobile devices with rich features can record videos, traffic parameters or air quality readings along user trajectories. Although such data may be valuable, users are seldom rewarded for collecting them. Emerging digital marketplaces allow owners to advertise their data to interested buyers. We focus on geo-marketplaces, where buyers search data based on geo-tags. Such marketplaces present signifi… ▽ More Mobile devices with rich features can record videos, traffic parameters or air quality readings along user trajectories. Although such data may be valuable, users are seldom rewarded for collecting them. Emerging digital marketplaces allow owners to advertise their data to interested buyers. We focus on geo-marketplaces, where buyers search data based on geo-tags. Such marketplaces present significant challenges. First, if owners upload data with revealed geo-tags, they expose themselves to serious privacy risks. Second, owners must be accountable for advertised data, and must not be allowed to subsequently alter geo-tags. Third, such a system may be vulnerable to intensive spam activities, where dishonest owners flood the system with fake advertisements. We propose a geo-marketplace that addresses all these concerns. We employ searchable encryption, digital commitments, and blockchain to protect the location privacy of owners while at the same time incorporating accountability and spam-resilience mechanisms. We implement a prototype with two alternative designs that obtain distinct trade-offs between trust assumptions and performance. Our experiments on real location data show that one can achieve the above design goals with practical performance and reasonable financial overhead. △ Less

Submitted 30 September, 2019; v1 submitted 31 August, 2019; originally announced September 2019.

Comments: SIGSPATIAL'19, 10 pages

Showing 1–13 of 13 results for author: Ghinita, G