Search | arXiv e-print repository

SoK: Demystifying Privacy Enhancing Technologies Through the Lens of Software Developers

Authors: Maisha Boteju, Thilina Ranbaduge, Dinusha Vatsalan, Nalin Asanka Gamagedara Arachchilage

Abstract: In the absence of data protection measures, software applications lead to privacy breaches, posing threats to end-users and software organisations. Privacy Enhancing Technologies (PETs) are technical measures that protect personal data, thus minimising such privacy breaches. However, for software applications to deliver data protection using PETs, software developers should actively and correctly… ▽ More In the absence of data protection measures, software applications lead to privacy breaches, posing threats to end-users and software organisations. Privacy Enhancing Technologies (PETs) are technical measures that protect personal data, thus minimising such privacy breaches. However, for software applications to deliver data protection using PETs, software developers should actively and correctly incorporate PETs into the software they develop. Therefore, to uncover ways to encourage and support developers to embed PETs into software, this Systematic Literature Review (SLR) analyses 39 empirical studies on developers' privacy practices. It reports the usage of six PETs in software application scenarios. Then, it discusses challenges developers face when integrating PETs into software, ranging from intrinsic challenges, such as the unawareness of PETs, to extrinsic challenges, such as the increased development cost. Next, the SLR presents the existing solutions to address these challenges, along with the limitations of the solutions. Further, it outlines future research avenues to better understand PETs from a developer perspective and minimise the challenges developers face when incorporating PETs into software. △ Less

Submitted 30 December, 2023; originally announced January 2024.

arXiv:2301.04000 [pdf, other]

Privacy-Preserving Record Linkage for Cardinality Counting

Authors: Nan Wu, Dinusha Vatsalan, Mohamed Ali Kaafar, Sanath Kumar Ramesh

Abstract: Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a n… ▽ More Several applications require counting the number of distinct items in the data, which is known as the cardinality counting problem. Example applications include health applications such as rare disease patients counting for adequate awareness and funding, and counting the number of cases of a new disease for outbreak detection, marketing applications such as counting the visibility reached for a new product, and cybersecurity applications such as tracking the number of unique views of social media posts. The data needed for the counting is however often personal and sensitive, and need to be processed using privacy-preserving techniques. The quality of data in different databases, for example typos, errors and variations, poses additional challenges for accurate cardinality estimation. While privacy-preserving cardinality counting has gained much attention in the recent times and a few privacy-preserving algorithms have been developed for cardinality estimation, no work has so far been done on privacy-preserving cardinality counting using record linkage techniques with fuzzy matching and provable privacy guarantees. We propose a novel privacy-preserving record linkage algorithm using unsupervised clustering techniques to link and count the cardinality of individuals in multiple datasets without compromising their privacy or identity. In addition, existing Elbow methods to find the optimal number of clusters as the cardinality are far from accurate as they do not take into account the purity and completeness of generated clusters. We propose a novel method to find the optimal number of clusters in unsupervised learning. Our experimental results on real and synthetic datasets are highly promising in terms of significantly smaller error rate of less than 0.1 with a privacy budget ε = 1.0 compared to the state-of-the-art fuzzy matching and clustering method. △ Less

Submitted 9 January, 2023; originally announced January 2023.

arXiv:2212.05682 [pdf, other]

doi 10.1007/978-3-319-63962-8_17-2

Privacy-Preserving Record Linkage

Authors: Dinusha Vatsalan, Dimitrios Karapiperis, Vassilios S. Verykios

Abstract: Given several databases containing person-specific data held by different organizations, Privacy-Preserving Record Linkage (PPRL) aims to identify and link records that correspond to the same entity/individual across different databases based on the matching of personal identifying attributes, such as name and address, without revealing the actual values in these attributes due to privacy concerns… ▽ More Given several databases containing person-specific data held by different organizations, Privacy-Preserving Record Linkage (PPRL) aims to identify and link records that correspond to the same entity/individual across different databases based on the matching of personal identifying attributes, such as name and address, without revealing the actual values in these attributes due to privacy concerns. This reference work entry defines the PPRL problem, reviews the literature and key findings, and discusses applications and research challenges. △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: PP. 1 - 10

Report number: Reference Work Entry - Version 2 (Original entry - https://doi.org/10.1007/978-3-319-63962-8_17-1)

Journal ref: Springer Encyclopedia of Big Data Technologies, 2022

arXiv:2211.02161 [pdf, ps, other]

Privacy-preserving Deep Learning based Record Linkage

Authors: Thilina Ranbaduge, Dinusha Vatsalan, Ming Ding

Abstract: Deep learning-based linkage of records across different databases is becoming increasingly useful in data integration and mining applications to discover new insights from multiple sources of data. However, due to privacy and confidentiality concerns, organisations often are not willing or allowed to share their sensitive data with any external parties, thus making it challenging to build/train de… ▽ More Deep learning-based linkage of records across different databases is becoming increasingly useful in data integration and mining applications to discover new insights from multiple sources of data. However, due to privacy and confidentiality concerns, organisations often are not willing or allowed to share their sensitive data with any external parties, thus making it challenging to build/train deep learning models for record linkage across different organizations' databases. To overcome this limitation, we propose the first deep learning-based multi-party privacy-preserving record linkage (PPRL) protocol that can be used to link sensitive databases held by multiple different organisations. In our approach, each database owner first trains a local deep learning model, which is then uploaded to a secure environment and securely aggregated to create a global model. The global model is then used by a linkage unit to distinguish unlabelled record pairs as matches and non-matches. We utilise differential privacy to achieve provable privacy protection against re-identification attacks. We evaluate the linkage quality and scalability of our approach using several large real-world databases, showing that it can achieve high linkage quality while providing sufficient privacy protection against existing attacks. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Comments: 11 pages

arXiv:2208.05264 [pdf, other]

doi 10.1109/TKDE.2022.3198478

Local Differentially Private Fuzzy Counting in Stream Data using Probabilistic Data Structure

Authors: Dinusha Vatsalan, Raghav Bhaskar, Mohamed Ali Kaafar

Abstract: Privacy-preserving estimation of counts of items in streaming data finds applications in several real-world scenarios including word auto-correction and traffic management applications. Recent works of RAPPOR and Apple's count-mean sketch (CMS) algorithm propose privacy preserving mechanisms for count estimation in large volumes of data using probabilistic data structures like counting Bloom filte… ▽ More Privacy-preserving estimation of counts of items in streaming data finds applications in several real-world scenarios including word auto-correction and traffic management applications. Recent works of RAPPOR and Apple's count-mean sketch (CMS) algorithm propose privacy preserving mechanisms for count estimation in large volumes of data using probabilistic data structures like counting Bloom filter and CMS. However, these existing methods fall short in providing a sound solution for real-time streaming data applications. In this work, we propose a novel (local) Differentially private mechanism that provides high utility for the streaming data count estimation problem with similar or even lower privacy budgets while providing: a) fuzzy counting to report counts of related or similar items (for instance to account for ty** errors and data variations), and b) improved querying efficiency to reduce the response time for real-time querying of counts. We provide formal proofs for privacy and utility guarantees and present extensive experimental evaluation of our algorithm using real and synthetic English words datasets for both the exact and fuzzy counting scenarios. Our privacy preserving mechanism substantially outperforms the prior work in terms of lower querying time, significantly higher utility (accuracy of count estimation) under similar or lower privacy guarantees, at the cost of communication overhead. △ Less

Submitted 30 November, 2022; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: Version 2 14 pages, Accepted in IEEE Transactions on Data and Knowledge Engineering, 2022

arXiv:2206.15089 [pdf, other]

Fairness and Cost Constrained Privacy-Aware Record Linkage

Authors: Nan Wu, Dinusha Vatsalan, Sunny Verma, Mohamed Ali Kaafar

Abstract: Record linkage algorithms match and link records from different databases that refer to the same real-world entity based on direct and/or quasi-identifiers, such as name, address, age, and gender, available in the records. Since these identifiers generally contain personal identifiable information (PII) about the entities, record linkage algorithms need to be developed with privacy constraints. Kn… ▽ More Record linkage algorithms match and link records from different databases that refer to the same real-world entity based on direct and/or quasi-identifiers, such as name, address, age, and gender, available in the records. Since these identifiers generally contain personal identifiable information (PII) about the entities, record linkage algorithms need to be developed with privacy constraints. Known as privacy-preserving record linkage (PPRL), many research studies have been conducted to perform the linkage on encoded and/or encrypted identifiers. Differential privacy (DP) combined with computationally efficient encoding methods, e.g. Bloom filter encoding, has been used to develop PPRL with provable privacy guarantees. The standard DP notion does not however address other constraints, among which the most important ones are fairness-bias and cost of linkage in terms of number of record pairs to be compared. In this work, we propose new notions of fairness-constrained DP and fairness and cost-constrained DP for PPRL and develop a framework for PPRL with these new notions of DP combined with Bloom filter encoding. We provide theoretical proofs for the new DP notions for fairness and cost-constrained PPRL and experimentally evaluate them on two datasets containing person-specific data. Our experimental results show that with these new notions of DP, PPRL with better performance (compared to the standard DP notion for PPRL) can be achieved with regard to privacy, cost and fairness constraints. △ Less

Submitted 30 June, 2022; originally announced June 2022.

arXiv:2205.06641 [pdf, other]

Privacy Preserving Release of Mobile Sensor Data

Authors: Rahat Masood, Wing Yan Cheng, Dinusha Vatsalan, Deepak Mishra, Hassan Jameel Asghar, Mohamed Ali Kaafar

Abstract: Sensors embedded in mobile smart devices can monitor users' activity with high accuracy to provide a variety of services to end-users ranging from precise geolocation, health monitoring, and handwritten word recognition. However, this involves the risk of accessing and potentially disclosing sensitive information of individuals to the apps that may lead to privacy breaches. In this paper, we aim t… ▽ More Sensors embedded in mobile smart devices can monitor users' activity with high accuracy to provide a variety of services to end-users ranging from precise geolocation, health monitoring, and handwritten word recognition. However, this involves the risk of accessing and potentially disclosing sensitive information of individuals to the apps that may lead to privacy breaches. In this paper, we aim to minimize privacy leakages that may lead to user identification on mobile devices through user tracking and distinguishability while preserving the functionality of apps and services. We propose a privacy-preserving mechanism that effectively handles the sensor data fluctuations (e.g., inconsistent sensor readings while walking, sitting, and running at different times) by formulating the data as time-series modeling and forecasting. The proposed mechanism also uses the notion of correlated noise-series against noise filtering attacks from an adversary, which aims to filter out the noise from the perturbed data to re-identify the original data. Unlike existing solutions, our mechanism keeps running in isolation without the interaction of a user or a service provider. We perform rigorous experiments on benchmark datasets and show that our proposed mechanism limits user tracking and distinguishability threats to a significant extent compared to the original data while maintaining a reasonable level of utility of functionalities. In general, we show that our obfuscation mechanism reduces the user trackability threat by 60\% across all the datasets while maintaining the utility loss below 0.5 Mean Absolute Error (MAE). We also observe that our mechanism is more effective in large datasets. For example, with the Swipes dataset, the distinguishability risk is reduced by 60\% on average while the utility loss is below 0.5 MAE. △ Less

Submitted 13 May, 2022; originally announced May 2022.

Comments: 12 pages, 10 figures, 1 table

arXiv:2002.06856 [pdf, other]

Data and Model Dependencies of Membership Inference Attack

Authors: Shakila Mahjabin Tonni, Dinusha Vatsalan, Farhad Farokhi, Dali Kaafar, Zhigang Lu, Gioacchino Tangari

Abstract: Machine learning (ML) models have been shown to be vulnerable to Membership Inference Attacks (MIA), which infer the membership of a given data point in the target dataset by observing the prediction output of the ML model. While the key factors for the success of MIA have not yet been fully understood, existing defense mechanisms such as using L2 regularization \cite{10shokri2017membership} and d… ▽ More Machine learning (ML) models have been shown to be vulnerable to Membership Inference Attacks (MIA), which infer the membership of a given data point in the target dataset by observing the prediction output of the ML model. While the key factors for the success of MIA have not yet been fully understood, existing defense mechanisms such as using L2 regularization \cite{10shokri2017membership} and dropout layers \cite{salem2018ml} take only the model's overfitting property into consideration. In this paper, we provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA. Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use. In particular, we show that the size of shadow dataset, the class and feature balance and the entropy of the target dataset, the configurations and fairness of the training model are the most influential factors. Based on those experimental findings, we conclude that along with model overfitting, multiple properties jointly contribute to MIA success instead of any single property. Building on our experimental findings, we propose using those data and model properties as regularizers to protect ML models against MIA. Our results show that the proposed defense mechanisms can reduce the MIA accuracy by up to 25\% without sacrificing the ML model prediction utility. △ Less

Submitted 25 July, 2020; v1 submitted 17 February, 2020; originally announced February 2020.

arXiv:1911.12930 [pdf, ps, other]

Incremental Clustering Techniques for Multi-Party Privacy-Preserving Record Linkage

Authors: Dinusha Vatsalan, Peter Christen, Erhard Rahm

Abstract: Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values.… ▽ More Privacy-Preserving Record Linkage (PPRL) supports the integration of sensitive information from multiple datasets, in particular the privacy-preserving matching of records referring to the same entity. PPRL has gained much attention in many application areas, with the most prominent ones in the healthcare domain. PPRL techniques tackle this problem by conducting linkage on masked (encoded) values. Employing PPRL on records from multiple (more than two) parties/sources (multi-party PPRL, MP-PPRL) is an increasingly important but challenging problem that so far has not been sufficiently solved. Existing MP-PPRL approaches are limited to finding only those entities that are present in all parties thereby missing entities that match only in a subset of parties. Furthermore, previous MP-PPRL approaches face substantial scalability limitations due to the need of a large number of comparisons between masked records. We thus propose and evaluate new MP-PPRL approaches that find matches in any subset of parties and still scale to many parties. Our approaches maintain all matches within clusters, where these clusters are incrementally extended or refined by considering records from one party after the other. An empirical evaluation using multiple real datasets ranging from 3 to 26 parties each containing up to $5$ million records validates that our protocols are efficient, and significantly outperform existing MP-PPRL approaches in terms of linkage quality and scalability. △ Less

Submitted 28 November, 2019; originally announced November 2019.

arXiv:1701.01232 [pdf, ps, other]

Scalable Multi-Database Privacy-Preserving Record Linkage using Counting Bloom Filters

Authors: Dinusha Vatsalan, Peter Christen, Erhard Rahm

Abstract: Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PP… ▽ More Privacy-preserving record linkage (PPRL) aims at integrating sensitive information from multiple disparate databases of different organizations. PPRL approaches are increasingly required in real-world application areas such as healthcare, national security, and business. Previous approaches have mostly focused on linking only two databases as well as the use of a dedicated linkage unit. Scaling PPRL to more databases (multi-party PPRL) is an open challenge since privacy threats as well as the computation and communication costs for record linkage increase significantly with the number of databases. We thus propose the use of a new encoding method of sensitive data based on Counting Bloom Filters (CBF) to improve privacy for multi-party PPRL. We also investigate optimizations to reduce communication and computation costs for CBF-based multi-party PPRL with and without the use of a dedicated linkage unit. Empirical evaluations conducted with real datasets show the viability of the proposed approaches and demonstrate their scalability, linkage quality, and privacy protection. △ Less

Submitted 5 January, 2017; originally announced January 2017.

Comments: This is an extended version of an article published in IEEE ICDM International Workshop on Privacy and Discrimination in Data Mining (PDDM) 2016 - Scalable privacy-preserving linking of multiple databases using counting Bloom filters

arXiv:1612.08835 [pdf, ps, other]

Multi-Party Privacy-Preserving Record Linkage using Bloom Filters

Authors: Dinusha Vatsalan, Peter Christen

Abstract: Privacy-preserving record linkage (PPRL), the problem of identifying records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these records, is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national… ▽ More Privacy-preserving record linkage (PPRL), the problem of identifying records that correspond to the same real-world entity across several data sources held by different parties without revealing any sensitive information about these records, is increasingly being required in many real-world application areas. Examples range from public health surveillance to crime and fraud detection, and national security. Various techniques have been developed to tackle the problem of PPRL, with the majority of them considering linking data from only two sources. However, in many real-world applications data from more than two sources need to be linked. In this paper we propose a viable solution for multi-party PPRL using two efficient privacy techniques: Bloom filter encoding and distributed secure summation. Our proposed protocol efficiently identifies matching sets of records held by all data sources that have a similarity above a certain minimum threshold. While being efficient, our protocol is also secure under the semi-honest adversary model in that no party can learn any sensitive information about any other parties' data, but all parties learn which of their records have a high similarity with records held by the other parties. We evaluate our protocol on a large real voter registration database showing the scalability, linkage quality, and privacy of our approach. △ Less

Submitted 28 December, 2016; originally announced December 2016.

Comments: Extended version of the poster paper published in proceedings of ACM Conference in Information and Knowledge Management (CIKM) 2014 (http://dl.acm.org/citation.cfm?id=2661875)

Showing 1–11 of 11 results for author: Vatsalan, D