Search | arXiv e-print repository

The Privacy-Utility Trade-off in the Topics API

Authors: Mário S. Alvim, Natasha Fernandes, Annabelle McIver, Gabriel H. Nunes

Abstract: The ongoing deprecation of third-party cookies by web browser vendors has sparked the proposal of alternative methods to support more privacy-preserving personalized advertising on web browsers and applications. The Topics API is being proposed by Google to provide third-parties with "coarse-grained advertising topics that the page visitor might currently be interested in". In this paper, we analy… ▽ More The ongoing deprecation of third-party cookies by web browser vendors has sparked the proposal of alternative methods to support more privacy-preserving personalized advertising on web browsers and applications. The Topics API is being proposed by Google to provide third-parties with "coarse-grained advertising topics that the page visitor might currently be interested in". In this paper, we analyze the re-identification risks for individual Internet users and the utility provided to advertising companies by the Topics API, i.e. learning the most popular topics and distinguishing between real and random topics. We provide theoretical results dependent only on the API parameters that can be readily applied to evaluate the privacy and utility implications of future API updates, including novel general upper-bounds that account for adversaries with access to unknown, arbitrary side information, the value of the differential privacy parameter $ε$, and experimental results on real-world data that validate our theoretical model. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: CCS '24 (to appear)

arXiv:2406.13569 [pdf, other]

Bayes' capacity as a measure for reconstruction attacks in federated learning

Authors: Sayan Biswas, Mark Dras, Pedro Faustini, Natasha Fernandes, Annabelle McIver, Catuscia Palamidessi, Parastoo Sadeghi

Abstract: Within the machine learning community, reconstruction attacks are a principal attack of concern and have been identified even in federated learning, which was designed with privacy preservation in mind. In federated learning, it has been shown that an adversary with knowledge of the machine learning architecture is able to infer the exact value of a training element given an observation of the wei… ▽ More Within the machine learning community, reconstruction attacks are a principal attack of concern and have been identified even in federated learning, which was designed with privacy preservation in mind. In federated learning, it has been shown that an adversary with knowledge of the machine learning architecture is able to infer the exact value of a training element given an observation of the weight updates performed during stochastic gradient descent. In response to these threats, the privacy community recommends the use of differential privacy in the stochastic gradient descent algorithm, termed DP-SGD. However, DP has not yet been formally established as an effective countermeasure against reconstruction attacks. In this paper, we formalise the reconstruction threat model using the information-theoretic framework of quantitative information flow. We show that the Bayes' capacity, related to the Sibson mutual information of order infinity, represents a tight upper bound on the leakage of the DP-SGD algorithm to an adversary interested in performing a reconstruction attack. We provide empirical results demonstrating the effectiveness of this measure for comparing mechanisms against reconstruction threats. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2402.07281 [pdf, other]

Can Tree Based Approaches Surpass Deep Learning in Anomaly Detection? A Benchmarking Study

Authors: Santonu Sarkar, Shanay Mehta, Nicole Fernandes, Jyotirmoy Sarkar, Snehanshu Saha

Abstract: Detection of anomalous situations for complex mission-critical systems holds paramount importance when their service continuity needs to be ensured. A major challenge in detecting anomalies from the operational data arises due to the imbalanced class distribution problem since the anomalies are supposed to be rare events. This paper evaluates a diverse array of machine learning-based anomaly detec… ▽ More Detection of anomalous situations for complex mission-critical systems holds paramount importance when their service continuity needs to be ensured. A major challenge in detecting anomalies from the operational data arises due to the imbalanced class distribution problem since the anomalies are supposed to be rare events. This paper evaluates a diverse array of machine learning-based anomaly detection algorithms through a comprehensive benchmark study. The paper contributes significantly by conducting an unbiased comparison of various anomaly detection algorithms, spanning classical machine learning including various tree-based approaches to deep learning and outlier detection methods. The inclusion of 104 publicly available and a few proprietary industrial systems datasets enhances the diversity of the study, allowing for a more realistic evaluation of algorithm performance and emphasizing the importance of adaptability to real-world scenarios. The paper dispels the deep learning myth, demonstrating that though powerful, deep learning is not a universal solution in this case. We observed that recently proposed tree-based evolutionary algorithms outperform in many scenarios. We noticed that tree-based approaches catch a singleton anomaly in a dataset where deep learning methods fail. On the other hand, classical SVM performs the best on datasets with more than 10% anomalies, implying that such scenarios can be best modeled as a classification problem rather than anomaly detection. To our knowledge, such a study on a large number of state-of-the-art algorithms using diverse data sets, with the objective of guiding researchers and practitioners in making informed algorithmic choices, has not been attempted earlier. △ Less

Submitted 25 February, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

arXiv:2309.14746 [pdf, other]

A Quantitative Information Flow Analysis of the Topics API

Authors: Mário S. Alvim, Natasha Fernandes, Annabelle McIver, Gabriel H. Nunes

Abstract: Third-party cookies have been a privacy concern since cookies were first developed in the mid 1990s, but more strict cookie policies were only introduced by Internet browser vendors in the early 2010s. More recently, due to regulatory changes, browser vendors have started to completely block third-party cookies, with both Firefox and Safari already compliant. The Topics API is being proposed by… ▽ More Third-party cookies have been a privacy concern since cookies were first developed in the mid 1990s, but more strict cookie policies were only introduced by Internet browser vendors in the early 2010s. More recently, due to regulatory changes, browser vendors have started to completely block third-party cookies, with both Firefox and Safari already compliant. The Topics API is being proposed by Google as an additional and less intrusive source of information for interest-based advertising (IBA), following the upcoming deprecation of third-party cookies. Initial results published by Google estimate the probability of a correct re-identification of a random individual would be below 3% while still supporting IBA. In this paper, we analyze the re-identification risk for individual Internet users introduced by the Topics API from the perspective of Quantitative Information Flow (QIF), an information- and decision-theoretic framework. Our model allows a theoretical analysis of both privacy and utility aspects of the API and their trade-off, and we show that the Topics API does have better privacy than third-party cookies. We leave the utility analyses for future work. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: WPES '23 (to appear)

arXiv:2308.11110 [pdf, other]

A novel analysis of utility in privacy pipelines, using Kronecker products and quantitative information flow

Authors: Mário S. Alvim, Natasha Fernandes, Annabelle McIver, Carroll Morgan, Gabriel H. Nunes

Abstract: We combine Kronecker products, and quantitative information flow, to give a novel formal analysis for the fine-grained verification of utility in complex privacy pipelines. The combination explains a surprising anomaly in the behaviour of utility of privacy-preserving pipelines -- that sometimes a reduction in privacy results also in a decrease in utility. We use the standard measure of utility fo… ▽ More We combine Kronecker products, and quantitative information flow, to give a novel formal analysis for the fine-grained verification of utility in complex privacy pipelines. The combination explains a surprising anomaly in the behaviour of utility of privacy-preserving pipelines -- that sometimes a reduction in privacy results also in a decrease in utility. We use the standard measure of utility for Bayesian analysis, introduced by Ghosh at al., to produce tractable and rigorous proofs of the fine-grained statistical behaviour leading to the anomaly. More generally, we offer the prospect of formal-analysis tools for utility that complement extant formal analyses of privacy. We demonstrate our results on a number of common privacy-preserving designs. △ Less

Submitted 7 November, 2023; v1 submitted 21 August, 2023; originally announced August 2023.

arXiv:2211.04686 [pdf, other]

Directional Privacy for Deep Learning

Authors: Pedro Faustini, Natasha Fernandes, Shakila Tonni, Annabelle McIver, Mark Dras

Abstract: Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method for applying privacy in the training of deep learning models. It applies isotropic Gaussian noise to gradients during training, which can perturb these gradients in any direction, damaging utility. Metric DP, however, can provide alternative mechanisms based on arbitrary metrics that might be more suitable for preserving u… ▽ More Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method for applying privacy in the training of deep learning models. It applies isotropic Gaussian noise to gradients during training, which can perturb these gradients in any direction, damaging utility. Metric DP, however, can provide alternative mechanisms based on arbitrary metrics that might be more suitable for preserving utility. In this paper, we apply \textit{directional privacy}, via a mechanism based on the von Mises-Fisher (VMF) distribution, to perturb gradients in terms of \textit{angular distance} so that gradient direction is broadly preserved. We show that this provides both $ε$-DP and $εd$-privacy for deep learning training, rather than the $(ε, δ)$-privacy of the Gaussian mechanism. Experiments on key datasets then indicate that the VMF mechanism can outperform the Gaussian in the utility-privacy trade-off. In particular, our experiments provide a direct empirical comparison of privacy between the two approaches in terms of their ability to defend against reconstruction and membership inference. △ Less

Submitted 26 November, 2023; v1 submitted 9 November, 2022; originally announced November 2022.

arXiv:2210.12916 [pdf, ps, other]

Explaining epsilon in local differential privacy through the lens of quantitative information flow

Authors: Natasha Fernandes, Annabelle McIver, Parastoo Sadeghi

Abstract: The study of leakage measures for privacy has been a subject of intensive research and is an important aspect of understanding how privacy leaks occur in computer systems. Differential privacy has been a focal point in the privacy community for some years and yet its leakage characteristics are not completely understood. In this paper we bring together two areas of research -- information theory a… ▽ More The study of leakage measures for privacy has been a subject of intensive research and is an important aspect of understanding how privacy leaks occur in computer systems. Differential privacy has been a focal point in the privacy community for some years and yet its leakage characteristics are not completely understood. In this paper we bring together two areas of research -- information theory and the g-leakage framework of quantitative information flow (QIF) -- to give an operational interpretation for the epsilon parameter of local differential privacy. We find that epsilon emerges as a capacity measure in both frameworks; via (log)-lift, a popular measure in information theory; and via max-case g-leakage, which we introduce to describe the leakage of any system to Bayesian adversaries modelled using ``worst-case'' assumptions under the QIF framework. Our characterisation resolves an important question of interpretability of epsilon and consolidates a number of disparate results covering the literature of both information theory and quantitative information flow. △ Less

Submitted 18 May, 2023; v1 submitted 23 October, 2022; originally announced October 2022.

arXiv:2206.06493 [pdf, other]

A novel reconstruction attack on foreign-trade official statistics, with a Brazilian case study

Authors: Danilo Fabrino Favato, Gabriel Coutinho, Mário S. Alvim, Natasha Fernandes

Abstract: In this paper we describe, formalize, implement, and experimentally evaluate a novel transaction re-identification attack against official foreign-trade statistics releases in Brazil. The attack's goal is to re-identify the importers of foreign-trade transactions (by revealing the identity of the company performing that transaction), which consequently violates those importers' fiscal secrecy (by… ▽ More In this paper we describe, formalize, implement, and experimentally evaluate a novel transaction re-identification attack against official foreign-trade statistics releases in Brazil. The attack's goal is to re-identify the importers of foreign-trade transactions (by revealing the identity of the company performing that transaction), which consequently violates those importers' fiscal secrecy (by revealing sensitive information: the value and volume of traded goods). We provide a mathematical formalization of this fiscal secrecy problem using principles from the framework of quantitative information flow (QIF), then carefully identify the main sources of imprecision in the official data releases used as auxiliary information in the attack, and model transaction re-construction as a linear optimization problem solvable through integer linear programming (ILP). We show that this problem is NP-complete, and provide a methodology to identify tractable instances. We exemplify the feasibility of our attack by performing 2,003 transaction re-identifications that in total amount to more than \$137M, and affect 348 Brazilian companies. Further, since similar statistics are produced by other statistical agencies, our attack is of broader concern. △ Less

Submitted 13 June, 2022; originally announced June 2022.

Comments: 35 pages

arXiv:2205.11519 [pdf, other]

FedSA: Accelerating Intrusion Detection in Collaborative Environments with Federated Simulated Annealing

Authors: Helio N. Cunha Neto, Ivana Dusparic, Diogo M. F. Mattos, Natalia C. Fernandes

Abstract: Fast identification of new network attack patterns is crucial for improving network security. Nevertheless, identifying an ongoing attack in a heterogeneous network is a non-trivial task. Federated learning emerges as a solution to collaborative training for an Intrusion Detection System (IDS). The federated learning-based IDS trains a global model using local machine learning models provided by f… ▽ More Fast identification of new network attack patterns is crucial for improving network security. Nevertheless, identifying an ongoing attack in a heterogeneous network is a non-trivial task. Federated learning emerges as a solution to collaborative training for an Intrusion Detection System (IDS). The federated learning-based IDS trains a global model using local machine learning models provided by federated participants without sharing local data. However, optimization challenges are intrinsic to federated learning. This paper proposes the Federated Simulated Annealing (FedSA) metaheuristic to select the hyperparameters and a subset of participants for each aggregation round in federated learning. FedSA optimizes hyperparameters linked to the global model convergence. The proposal reduces aggregation rounds and speeds up convergence. Thus, FedSA accelerates learning extraction from local models, requiring fewer IDS updates. The proposal assessment shows that the FedSA global model converges in less than ten communication rounds. The proposal requires up to 50% fewer aggregation rounds to achieve approximately 97% accuracy in attack detection than the conventional aggregation approach. △ Less

Submitted 23 May, 2022; originally announced May 2022.

arXiv:2205.01258 [pdf, other]

Universal Optimality and Robust Utility Bounds for Metric Differential Privacy

Authors: Natasha Fernandes, Annabelle McIver, Catuscia Palamidessi, Ming Ding

Abstract: We study the privacy-utility trade-off in the context of metric differential privacy. Ghosh et al. introduced the idea of universal optimality to characterise the best mechanism for a certain query that simultaneously satisfies (a fixed) $ε$-differential privacy constraint whilst at the same time providing better utility compared to any other $ε$-differentially private mechanism for the same query… ▽ More We study the privacy-utility trade-off in the context of metric differential privacy. Ghosh et al. introduced the idea of universal optimality to characterise the best mechanism for a certain query that simultaneously satisfies (a fixed) $ε$-differential privacy constraint whilst at the same time providing better utility compared to any other $ε$-differentially private mechanism for the same query. They showed that the Geometric mechanism is "universally optimal" for the class of counting queries. On the other hand, Brenner and Nissim showed that outside the space of counting queries, and for the Bayes risk loss function, no such universally optimal mechanisms exist. In this paper we use metric differential privacy and quantitative information flow as the fundamental principle for studying universal optimality. Metric differential privacy is a generalisation of both standard (i.e., central) differential privacy and local differential privacy, and it is increasingly being used in various application domains, for instance in location privacy and in privacy preserving machine learning. Using this framework we are able to clarify Nissim and Brenner's negative results, showing (a) that in fact all privacy types contain optimal mechanisms relative to certain kinds of non-trivial loss functions, and (b) extending and generalising their negative results beyond Bayes risk specifically to a wide class of non-trivial loss functions. We also propose weaker universal benchmarks of utility called "privacy type capacities". We show that such capacities always exist and can be computed using a convex optimisation algorithm. △ Less

Submitted 2 May, 2022; originally announced May 2022.

arXiv:2204.13734 [pdf, other]

doi 10.56553/popets-2022-0114

doi 10.5281/zenodo.6533704

Flexible and scalable privacy assessment for very large datasets, with an application to official governmental microdata

Authors: Mário S. Alvim, Natasha Fernandes, Annabelle McIver, Carroll Morgan, Gabriel H. Nunes

Abstract: We present a systematic refactoring of the conventional treatment of privacy analyses, basing it on mathematical concepts from the framework of Quantitative Information Flow (QIF). The approach we suggest brings three principal advantages: it is flexible, allowing for precise quantification and comparison of privacy risks for attacks both known and novel; it can be computationally tractable for ve… ▽ More We present a systematic refactoring of the conventional treatment of privacy analyses, basing it on mathematical concepts from the framework of Quantitative Information Flow (QIF). The approach we suggest brings three principal advantages: it is flexible, allowing for precise quantification and comparison of privacy risks for attacks both known and novel; it can be computationally tractable for very large, longitudinal datasets; and its results are explainable both to politicians and to the general public. We apply our approach to a very large case study: the Educational Censuses of Brazil, curated by the governmental agency INEP, which comprise over 90 attributes of approximately 50 million individuals released longitudinally every year since 2007. These datasets have only very recently (2018-2021) attracted legislation to regulate their privacy -- while at the same time continuing to maintain the openness that had been sought in Brazilian society. INEP's reaction to that legislation was the genesis of our project with them. In our conclusions here we share the scientific, technical, and communication lessons we learned in the process. △ Less

Submitted 25 July, 2022; v1 submitted 28 April, 2022; originally announced April 2022.

Journal ref: PoPETs 2022.4 (2022) 378-399

arXiv:2105.07176 [pdf, other]

The Laplace Mechanism has optimal utility for differential privacy over continuous queries

Authors: Natasha Fernandes, Annabelle McIver, Carroll Morgan

Abstract: Differential Privacy protects individuals' data when statistical queries are published from aggregated databases: applying "obfuscating" mechanisms to the query results makes the released information less specific but, unavoidably, also decreases its utility. Yet it has been shown that for discrete data (e.g. counting queries), a mandated degree of privacy and a reasonable interpretation of loss o… ▽ More Differential Privacy protects individuals' data when statistical queries are published from aggregated databases: applying "obfuscating" mechanisms to the query results makes the released information less specific but, unavoidably, also decreases its utility. Yet it has been shown that for discrete data (e.g. counting queries), a mandated degree of privacy and a reasonable interpretation of loss of utility, the Geometric obfuscating mechanism is optimal: it loses as little utility as possible. For continuous query results however (e.g. real numbers) the optimality result does not hold. Our contribution here is to show that optimality is regained by using the Laplace mechanism for the obfuscation. The technical apparatus involved includes the earlier discrete result by Ghosh et al., recent work on abstract channels and their geometric representation as hyper-distributions, and the dual interpretations of distance between distributions provided by the Kantorovich-Rubinstein Theorem. △ Less

Submitted 26 July, 2021; v1 submitted 15 May, 2021; originally announced May 2021.

arXiv:2011.08127 [pdf, other]

The Influence of Domain-Based Preprocessing on Subject-Specific Clustering

Authors: Alexandra Gkolia, Nikhil Fernandes, Nicolas Pizzo, James Davenport, Akshar Nair

Abstract: The sudden change of moving the majority of teaching online at Universities due to the global Covid-19 pandemic has caused an increased amount of workload for academics. One of the contributing factors is answering a high volume of queries coming from students. As these queries are not limited to the synchronous time frame of a lecture, there is a high chance of many of them being related or even… ▽ More The sudden change of moving the majority of teaching online at Universities due to the global Covid-19 pandemic has caused an increased amount of workload for academics. One of the contributing factors is answering a high volume of queries coming from students. As these queries are not limited to the synchronous time frame of a lecture, there is a high chance of many of them being related or even equivalent. One way to deal with this problem is to cluster these questions depending on their topic. In our previous work, we aimed to find an improved method of clustering that would give us a high efficiency, using a recurring LDA model. Our data set contained questions posted online from a Computer Science course at the University of Bath. A significant number of these questions contained code excerpts, which we found caused a problem in clustering, as certain terms were being considered as common words in the English language and not being recognised as specific code terms. To address this, we implemented tagging of these technical terms using Python, as part of preprocessing the data set. In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results in order to justify our reasoning. △ Less

Submitted 16 November, 2020; originally announced November 2020.

Comments: 8 pages, 5 figures

arXiv:2011.01035 [pdf, other]

Unification of HDP and LDA Models for Optimal Topic Clustering of Subject Specific Question Banks

Authors: Nikhil Fernandes, Alexandra Gkolia, Nicolas Pizzo, James Davenport, Akshar Nair

Abstract: There has been an increasingly popular trend in Universities for curriculum transformation to make teaching more interactive and suitable for online courses. An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics. This, coupled with the fact that if lectures were delivered in a video on demand format, there would be no fixe… ▽ More There has been an increasingly popular trend in Universities for curriculum transformation to make teaching more interactive and suitable for online courses. An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics. This, coupled with the fact that if lectures were delivered in a video on demand format, there would be no fixed time where the majority of students could ask questions. When questions are asked in a lecture there is a negligible chance of having similar questions repeatedly, but asynchronously this is more likely. In order to reduce the time spent on answering each individual question, clustering them is an ideal choice. There are different unsupervised models fit for text clustering, of which the Latent Dirichlet Allocation model is the most commonly used. We use the Hierarchical Dirichlet Process to determine an optimal topic number input for our LDA model runs. Due to the probabilistic nature of these topic models, the outputs of them vary for different runs. The general trend we found is that not all the topics were being used for clustering on the first run of the LDA model, which results in a less effective clustering. To tackle probabilistic output, we recursively use the LDA model on the effective topics being used until we obtain an efficiency ratio of 1. Through our experimental results we also establish a reasoning on how Zeno's paradox is avoided. △ Less

Submitted 4 October, 2020; originally announced November 2020.

Comments: 8 pages, 5 figures, Submitted to EAAI21

arXiv:2010.09393 [pdf, other]

doi 10.1007/978-3-030-88428-4_28

Locality Sensitive Hashing with Extended Differential Privacy

Authors: Natasha Fernandes, Yusuke Kawamoto, Takao Murakami

Abstract: Extended differential privacy, a generalization of standard differential privacy (DP) using a general metric, has been widely studied to provide rigorous privacy guarantees while kee** high utility. However, existing works on extended DP are limited to few metrics, such as the Euclidean metric. Consequently, they have only a small number of applications, such as location-based services and docum… ▽ More Extended differential privacy, a generalization of standard differential privacy (DP) using a general metric, has been widely studied to provide rigorous privacy guarantees while kee** high utility. However, existing works on extended DP are limited to few metrics, such as the Euclidean metric. Consequently, they have only a small number of applications, such as location-based services and document processing. In this paper, we propose a couple of mechanisms providing extended DP with a different metric: angular distance (or cosine distance). Our mechanisms are based on locality sensitive hashing (LSH), which can be applied to the angular distance and work well for personal data in a high-dimensional space. We theoretically analyze the privacy properties of our mechanisms, and prove extended DP for input data by taking into account that LSH preserves the original metric only approximately. We apply our mechanisms to friend matching based on high-dimensional personal data with angular distance in the local model, and evaluate our mechanisms using two real datasets. We show that LDP requires a very large privacy budget and that RAPPOR does not work in this application. Then we show that our mechanisms enable friend matching with high utility and rigorous privacy guarantees based on extended DP. △ Less

Submitted 12 August, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: ESORICS 2021 (the 26th European Symposium on Research in Computer Security)

Journal ref: Proceedings of the 26th European Symposium on Research in Computer Security (ESORICS 2021), Part II, Lecture Notes in Computer Science Vol. 12973, pp.563-583, 2021

arXiv:1906.12147 [pdf, other]

Utility-Preserving Privacy Mechanisms for Counting Queries

Authors: Natasha Fernandes, Kacem Lefki, Catuscia Palamidessi

Abstract: Differential privacy (DP) and local differential privacy (LPD) are frameworks to protect sensitive information in data collections. They are both based on obfuscation. In DP the noise is added to the result of queries on the dataset, whereas in LPD the noise is added directly on the individual records, before being collected. The main advantage of LPD with respect to DP is that it does not need to… ▽ More Differential privacy (DP) and local differential privacy (LPD) are frameworks to protect sensitive information in data collections. They are both based on obfuscation. In DP the noise is added to the result of queries on the dataset, whereas in LPD the noise is added directly on the individual records, before being collected. The main advantage of LPD with respect to DP is that it does not need to assume a trusted third party. The main disadvantage is that the trade-off between privacy and utility is usually worse than in DP, and typically to retrieve reasonably good statistics from the locally sanitized data it is necessary to have a huge collection of them. In this paper, we focus on the problem of estimating counting queries from collections of noisy answers, and we propose a variant of LDP based on the addition of geometric noise. Our main result is that the geometric noise has a better statistical utility than other LPD mechanisms from the literature. △ Less

Submitted 28 June, 2019; originally announced June 2019.

Journal ref: Models, Languages and Tools for Concurrent and Distributed Programming, LNCS 11665, Springer, 2019

arXiv:1811.10256 [pdf, other]

Generalised Differential Privacy for Text Document Processing

Authors: Natasha Fernandes, Mark Dras, Annabelle McIver

Abstract: We address the problem of how to "obfuscate" texts by removing stylistic clues which can identify authorship, whilst preserving (as much as possible) the content of the text. In this paper we combine ideas from "generalised differential privacy" and machine learning techniques for text processing to model privacy for text documents. We define a privacy mechanism that operates at the level of text… ▽ More We address the problem of how to "obfuscate" texts by removing stylistic clues which can identify authorship, whilst preserving (as much as possible) the content of the text. In this paper we combine ideas from "generalised differential privacy" and machine learning techniques for text processing to model privacy for text documents. We define a privacy mechanism that operates at the level of text documents represented as "bags-of-words" - these representations are typical in machine learning and contain sufficient information to carry out many kinds of classification tasks including topic identification and authorship attribution (of the original documents). We show that our mechanism satisfies privacy with respect to a metric for semantic similarity, thereby providing a balance between utility, defined by the semantic content of texts, with the obfuscation of stylistic clues. We demonstrate our implementation on a "fan fiction" dataset, confirming that it is indeed possible to disguise writing style effectively whilst preserving enough information and variation for accurate content classification tasks. △ Less

Submitted 5 February, 2019; v1 submitted 26 November, 2018; originally announced November 2018.

Comments: Typos corrected

arXiv:1805.08866 [pdf, ps, other]

Author Obfuscation Using Generalised Differential Privacy

Authors: Natasha Fernandes, Mark Dras, Annabelle McIver

Abstract: The problem of obfuscating the authorship of a text document has received little attention in the literature to date. Current approaches are ad-hoc and rely on assumptions about an adversary's auxiliary knowledge which makes it difficult to reason about the privacy properties of these methods. Differential privacy is a well-known and robust privacy approach, but its reliance on the notion of adjac… ▽ More The problem of obfuscating the authorship of a text document has received little attention in the literature to date. Current approaches are ad-hoc and rely on assumptions about an adversary's auxiliary knowledge which makes it difficult to reason about the privacy properties of these methods. Differential privacy is a well-known and robust privacy approach, but its reliance on the notion of adjacency between datasets has prevented its application to text document privacy. However, generalised differential privacy permits the application of differential privacy to arbitrary datasets endowed with a metric and has been demonstrated on problems involving the release of individual data points. In this paper we show how to apply generalised differential privacy to author obfuscation by utilising existing tools and methods from the stylometry and natural language processing literature. △ Less

Submitted 22 May, 2018; originally announced May 2018.

arXiv:1609.00878 [pdf, ps, other]

A Probabilistic Optimum-Path Forest Classifier for Binary Classification Problems

Authors: Silas E. N. Fernandes, Danillo R. Pereira, Caio C. O. Ramos, Andre N. Souza, Joao P. Papa

Abstract: Probabilistic-driven classification techniques extend the role of traditional approaches that output labels (usually integer numbers) only. Such techniques are more fruitful when dealing with problems where one is not interested in recognition/identification only, but also into monitoring the behavior of consumers and/or machines, for instance. Therefore, by means of probability estimates, one can… ▽ More Probabilistic-driven classification techniques extend the role of traditional approaches that output labels (usually integer numbers) only. Such techniques are more fruitful when dealing with problems where one is not interested in recognition/identification only, but also into monitoring the behavior of consumers and/or machines, for instance. Therefore, by means of probability estimates, one can take decisions to work better in a number of scenarios. In this paper, we propose a probabilistic-based Optimum Path Forest (OPF) classifier to handle with binary classification problems, and we show it can be more accurate than naive OPF in a number of datasets. In addition to being just more accurate or not, probabilistic OPF turns to be another useful tool to the scientific community. △ Less

Submitted 3 September, 2016; originally announced September 2016.

Comments: Submitted to Neural Processing Letters

Showing 1–19 of 19 results for author: Fernandes, N