Search | arXiv e-print repository

Adaptive Data Analysis for Growing Data

Authors: Neil G. Marchant, Benjamin I. P. Rubinstein

Abstract: Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodat… ▽ More Reuse of data in adaptive workflows poses challenges regarding overfitting and the statistical validity of results. Previous work has demonstrated that interacting with data via differentially private algorithms can mitigate overfitting, achieving worst-case generalization guarantees with asymptotically optimal data requirements. However, such past work assumes data is static and cannot accommodate situations where data grows over time. In this paper we address this gap, presenting the first generalization bounds for adaptive analysis in the dynamic data setting. We allow the analyst to adaptively schedule their queries conditioned on the current size of the data, in addition to previous queries and responses. We also incorporate time-varying empirical accuracy bounds and mechanisms, allowing for tighter guarantees as data accumulates. In a batched query setting, the asymptotic data requirements of our bound grows with the square-root of the number of adaptive queries, matching prior works' improvement over data splitting for the static setting. We instantiate our bound for statistical queries with the clipped Gaussian mechanism, where it empirically outperforms baselines composed from static bounds. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2405.11575 [pdf, other]

SEEP: Training Dynamics Grounds Latent Representation Search for Mitigating Backdoor Poisoning Attacks

Authors: Xuanli He, Qiongkai Xu, Jun Wang, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model's behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have… ▽ More Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model's behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets. △ Less

Submitted 19 May, 2024; originally announced May 2024.

Comments: accepted to TACL

arXiv:2405.08892 [pdf, other]

RS-Reg: Probabilistic and Robust Certified Regression Through Randomized Smoothing

Authors: Aref Miri Rekavandi, Olga Ohrimenko, Benjamin I. P. Rubinstein

Abstract: Randomized smoothing has shown promising certified robustness against adversaries in classification tasks. Despite such success with only zeroth-order access to base models, randomized smoothing has not been extended to a general form of regression. By defining robustness in regression tasks flexibly through probabilities, we demonstrate how to establish upper bounds on input data point perturbati… ▽ More Randomized smoothing has shown promising certified robustness against adversaries in classification tasks. Despite such success with only zeroth-order access to base models, randomized smoothing has not been extended to a general form of regression. By defining robustness in regression tasks flexibly through probabilities, we demonstrate how to establish upper bounds on input data point perturbation (using the $\ell_2$ norm) for a user-specified probability of observing valid outputs. Furthermore, we showcase the asymptotic property of a basic averaging function in scenarios where the regression model operates without any constraint. We then derive a certified upper bound of the input perturbations when dealing with a family of regression models where the outputs are bounded. Our simulations verify the validity of the theoretical results and reveal the advantages and limitations of simple smoothing functions, i.e., averaging, in regression tasks. The code is publicly available at \url{https://github.com/arekavandi/Certified_Robust_Regression}. △ Less

Submitted 14 May, 2024; originally announced May 2024.

arXiv:2404.19597 [pdf, other]

Transferring Troubles: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

Authors: Xuanli He, Jun Wang, Qiongkai Xu, Pasquale Minervini, Pontus Stenetorp, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. However, the impact of backdoor attacks on multilingual models remains under-explored. Our research focuses on cross-lingual backdoor att… ▽ More The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. However, the impact of backdoor attacks on multilingual models remains under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data in one or two languages can affect the outputs in languages whose instruction-tuning data was not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5, BLOOM, and GPT-3.5-turbo, with high attack success rates, surpassing 95% in several languages across various scenarios. Alarmingly, our findings also indicate that larger models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments show that triggers can still work even after paraphrasing, and the backdoor mechanism proves highly effective in cross-lingual response settings across 25 languages, achieving an average attack success rate of 50%. Our study aims to highlight the vulnerabilities and significant security risks present in current multilingual LLMs, underscoring the emergent need for targeted security measures. △ Less

Submitted 30 April, 2024; originally announced April 2024.

Comments: work in progress

arXiv:2404.02393 [pdf, other]

Backdoor Attack on Multilingual Machine Translation

Authors: Jun Wang, Qiongkai Xu, Xuanli He, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages. Our… ▽ More While multilingual machine translation (MNMT) systems hold substantial promise, they also have security vulnerabilities. Our research highlights that MNMT systems can be susceptible to a particularly devious style of backdoor attack, whereby an attacker injects poisoned data into a low-resource language pair to cause malicious translations in other languages, including high-resource languages. Our experimental results reveal that injecting less than 0.01% poisoned data into a low-resource language pair can achieve an average 20% attack success rate in attacking high-resource language pairs. This type of attack is of particular concern, given the larger attack surface of languages inherent to low-resource settings. Our aim is to bring attention to these vulnerabilities within MNMT systems with the hope of encouraging the community to address security concerns in machine translation, especially in the context of low-resource languages. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: NAACL main long paper

arXiv:2401.17628 [pdf, other]

Elephants Do Not Forget: Differential Privacy with State Continuity for Privacy Budget

Authors: Jiankai **, Chitchanok Chuengsatiansup, Toby Murray, Benjamin I. P. Rubinstein, Yuval Yarom, Olga Ohrimenko

Abstract: Current implementations of differentially-private (DP) systems either lack support to track the global privacy budget consumed on a dataset, or fail to faithfully maintain the state continuity of this budget. We show that failure to maintain a privacy budget enables an adversary to mount replay, rollback and fork attacks - obtaining answers to many more queries than what a secure system would allo… ▽ More Current implementations of differentially-private (DP) systems either lack support to track the global privacy budget consumed on a dataset, or fail to faithfully maintain the state continuity of this budget. We show that failure to maintain a privacy budget enables an adversary to mount replay, rollback and fork attacks - obtaining answers to many more queries than what a secure system would allow. As a result the attacker can reconstruct secret data that DP aims to protect - even if DP code runs in a Trusted Execution Environment (TEE). We propose ElephantDP, a system that aims to provide the same guarantees as a trusted curator in the global DP model would, albeit set in an untrusted environment. Our system relies on a state continuity module to provide protection for the privacy budget and a TEE to faithfully execute DP code and update the budget. To provide security, our protocol makes several design choices including the content of the persistent state and the order between budget updates and query answers. We prove that ElephantDP provides liveness (i.e., the protocol can restart from a correct state and respond to queries as long as the budget is not exceeded) and DP confidentiality (i.e., an attacker learns about a dataset as much as it would from interacting with a trusted curator). Our implementation and evaluation of the protocol use Intel SGX as a TEE to run the DP code and a network of TEEs to maintain state continuity. Compared to an insecure baseline, we observe only 1.1-2$\times$ overheads and lower relative overheads for larger datasets and complex DP queries. △ Less

Submitted 31 January, 2024; originally announced January 2024.

arXiv:2309.11005 [pdf, other]

It's Simplex! Disaggregating Measures to Improve Certified Robustness

Authors: Andrew C. Cullen, Paul Montague, Shijie Liu, Sarah M. Erfani, Benjamin I. P. Rubinstein

Abstract: Certified robustness circumvents the fragility of defences against adversarial attacks, by endowing model predictions with guarantees of class invariance for attacks up to a calculated size. While there is value in these certifications, the techniques through which we assess their performance do not present a proper accounting of their strengths and weaknesses, as their analysis has eschewed consi… ▽ More Certified robustness circumvents the fragility of defences against adversarial attacks, by endowing model predictions with guarantees of class invariance for attacks up to a calculated size. While there is value in these certifications, the techniques through which we assess their performance do not present a proper accounting of their strengths and weaknesses, as their analysis has eschewed consideration of performance over individual samples in favour of aggregated measures. By considering the potential output space of certified models, this work presents two distinct approaches to improve the analysis of certification mechanisms, that allow for both dataset-independent and dataset-dependent measures of certification performance. Embracing such a perspective uncovers new certification approaches, which have the potential to more than double the achievable radius of certification, relative to current state-of-the-art. Empirical evaluation verifies that our new approach can certify $9\%$ more samples at noise scale $σ= 1$, with greater relative improvements observed as the difficulty of the predictive task increases. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: IEEE S&P 2024, IEEE Security & Privacy 2024, 14 pages

arXiv:2308.07553 [pdf, other]

doi 10.1609/aaai.v37i7.26065

Enhancing the Antidote: Improved Pointwise Certifications against Poisoning Attacks

Authors: Shijie Liu, Andrew C. Cullen, Paul Montague, Sarah M. Erfani, Benjamin I. P. Rubinstein

Abstract: Poisoning attacks can disproportionately influence model behaviour by making small changes to the training corpus. While defences against specific poisoning attacks do exist, they in general do not provide any guarantees, leaving them potentially countered by novel attacks. In contrast, by examining worst-case behaviours Certified Defences make it possible to provide guarantees of the robustness o… ▽ More Poisoning attacks can disproportionately influence model behaviour by making small changes to the training corpus. While defences against specific poisoning attacks do exist, they in general do not provide any guarantees, leaving them potentially countered by novel attacks. In contrast, by examining worst-case behaviours Certified Defences make it possible to provide guarantees of the robustness of a sample against adversarial attacks modifying a finite number of training samples, known as pointwise certification. We achieve this by exploiting both Differential Privacy and the Sampled Gaussian Mechanism to ensure the invariance of prediction for each testing instance against finite numbers of poisoned examples. In doing so, our model provides guarantees of adversarial robustness that are more than twice as large as those provided by prior certifications. △ Less

Submitted 18 March, 2024; v1 submitted 14 August, 2023; originally announced August 2023.

Journal ref: Proceedings of the 2023 AAAI Conference on Artificial Intelligence, 37(7), 8861-8869

arXiv:2305.07156 [pdf, ps, other]

Improved Upper and Lower Bounds on the Capacity of the Binary Deletion Channel

Authors: Ittai Rubinstein, Roni Con

Abstract: The {\em binary deletion channel} with deletion probability $d$ ($\text{BDC}_d$) is a random channel that deletes each bit of the input message i.i.d with probability $d$. It has been studied extensively as a canonical example of a channel with synchronization errors. Perhaps the most important question regarding the BDC is determining its capacity. Mitzenmacher and Drinea (ITIT 2006) and Kirsch… ▽ More The {\em binary deletion channel} with deletion probability $d$ ($\text{BDC}_d$) is a random channel that deletes each bit of the input message i.i.d with probability $d$. It has been studied extensively as a canonical example of a channel with synchronization errors. Perhaps the most important question regarding the BDC is determining its capacity. Mitzenmacher and Drinea (ITIT 2006) and Kirsch and Drinea (ITIT 2009) show a method by which distributions on run lengths can be converted to codes for the BDC, yielding a lower bound of $\mathcal{C}(\text{BDC}_d) > 0.1185 \cdot (1-d)$. Fertonani and Duman (ITIT 2010), Dalai (ISIT 2011) and Rahmati and Duman (ITIT 2014) use computer aided analyses based on the Blahut-Arimoto algorithm to prove an upper bound of $\mathcal{C}(\text{BDC}_d) < 0.4143\cdot(1-d)$ in the high deletion probability regime ($d > 0.65$). In this paper, we show that the Blahut-Arimoto algorithm can be implemented with a lower space complexity, allowing us to extend the upper bound analyses, and prove an upper bound of $\mathcal{C}(\text{BDC}_d) < 0.3745 \cdot(1-d)$ for all $d \geq 0.68$. Furthermore, we show that an extension of the Blahut-Arimoto algorithm can also be used to select better run length distributions for Mitzenmacher and Drinea's construction, yielding a lower bound of $\mathcal{C}(\text{BDC}_d) > 0.1221 \cdot (1 - d)$. △ Less

Submitted 11 May, 2023; originally announced May 2023.

MSC Class: 94B65 ACM Class: E.4

arXiv:2302.04379 [pdf, other]

Et Tu Certifications: Robustness Certificates Yield Better Adversarial Examples

Authors: Andrew C. Cullen, Shijie Liu, Paul Montague, Sarah M. Erfani, Benjamin I. P. Rubinstein

Abstract: In guaranteeing the absence of adversarial examples in an instance's neighbourhood, certification mechanisms play an important role in demonstrating neural net robustness. In this paper, we ask if these certifications can compromise the very models they help to protect? Our new \emph{Certification Aware Attack} exploits certifications to produce computationally efficient norm-minimising adversaria… ▽ More In guaranteeing the absence of adversarial examples in an instance's neighbourhood, certification mechanisms play an important role in demonstrating neural net robustness. In this paper, we ask if these certifications can compromise the very models they help to protect? Our new \emph{Certification Aware Attack} exploits certifications to produce computationally efficient norm-minimising adversarial examples $74 \%$ more often than comparable attacks, while reducing the median perturbation norm by more than $10\%$. While these attacks can be used to assess the tightness of certification bounds, they also highlight that releasing certifications can paradoxically reduce security. △ Less

Submitted 11 June, 2024; v1 submitted 8 February, 2023; originally announced February 2023.

Comments: 17 pages, 8 figures

ACM Class: I.2.6; I.4.9

arXiv:2302.01757 [pdf, other]

RS-Del: Edit Distance Robustness Certificates for Sequence Classifiers via Randomized Deletion

Authors: Zhuoqun Huang, Neil G. Marchant, Keane Lucas, Lujo Bauer, Olga Ohrimenko, Benjamin I. P. Rubinstein

Abstract: Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for… ▽ More Randomized smoothing is a leading approach for constructing classifiers that are certifiably robust against adversarial examples. Existing work on randomized smoothing has focused on classifiers with continuous inputs, such as images, where $\ell_p$-norm bounded adversaries are commonly studied. However, there has been limited work for classifiers with discrete or variable-size inputs, such as for source code, which require different threat models and smoothing mechanisms. In this work, we adapt randomized smoothing for discrete sequence classifiers to provide certified robustness against edit distance-bounded adversaries. Our proposed smoothing mechanism randomized deletion (RS-Del) applies random deletion edits, which are (perhaps surprisingly) sufficient to confer robustness against adversarial deletion, insertion and substitution edits. Our proof of certification deviates from the established Neyman-Pearson approach, which is intractable in our setting, and is instead organized around longest common subsequences. We present a case study on malware detection--a binary classification problem on byte sequences where classifier evasion is a well-established threat model. When applied to the popular MalConv malware detection model, our smoothing mechanism RS-Del achieves a certified accuracy of 91% at an edit distance radius of 128 bytes. △ Less

Submitted 24 January, 2024; v1 submitted 30 January, 2023; originally announced February 2023.

Comments: Final camera-ready version for NeurIPS 2023. 36 pages, 7 figures, 12 tables. Includes 20 pages of appendices. Code available at https://github.com/Dovermore/randomized-deletion

arXiv:2301.02962 [pdf, other]

doi 10.1093/jssam/smac030

Bayesian Graphical Entity Resolution Using Exchangeable Random Partition Priors

Authors: Neil G. Marchant, Benjamin I. P. Rubinstein, Rebecca C. Steorts

Abstract: Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, wh… ▽ More Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tractable set of priors for the linkage structure, which corresponds to a special class of random partition models. Second, we propose a more realistic distortion model for categorical/discrete record attributes, which corrects a logical inconsistency with the standard hit-miss model. Third, we incorporate hyperpriors to improve flexibility. Fourth, we employ a partially collapsed Gibbs sampler for inferential speedups. Using a selection of private and nonprivate data sets, we investigate the impact of our modeling contributions and compare our model with two alternative Bayesian models. In addition, we conduct a simulation study for household survey data, where we vary distortion, duplication rates and data set size. We find that our model performs more consistently than the alternatives across a variety of scenarios and typically achieves the highest entity resolution accuracy (F1 score). Open source software is available for our proposed methodology, and we provide a discussion regarding our work and future directions. △ Less

Submitted 7 January, 2023; originally announced January 2023.

Comments: 27 pages, 4 figures, 3 tables. Includes 37 pages of appendices. This is an accepted manuscript to be published in the Journal of Survey Statistics and Methodology

arXiv:2210.06077 [pdf, other]

Double Bubble, Toil and Trouble: Enhancing Certified Robustness through Transitivity

Authors: Andrew C. Cullen, Paul Montague, Shijie Liu, Sarah M. Erfani, Benjamin I. P. Rubinstein

Abstract: In response to subtle adversarial examples flip** classifications of neural network models, recent research has promoted certified robustness as a solution. There, invariance of predictions to all norm-bounded attacks is achieved through randomised smoothing of network inputs. Today's state-of-the-art certifications make optimal use of the class output scores at the input instance under test: no… ▽ More In response to subtle adversarial examples flip** classifications of neural network models, recent research has promoted certified robustness as a solution. There, invariance of predictions to all norm-bounded attacks is achieved through randomised smoothing of network inputs. Today's state-of-the-art certifications make optimal use of the class output scores at the input instance under test: no better radius of certification (under the $L_2$ norm) is possible given only these score. However, it is an open question as to whether such lower bounds can be improved using local information around the instance under test. In this work, we demonstrate how today's "optimal" certificates can be improved by exploiting both the transitivity of certifications, and the geometry of the input space, giving rise to what we term Geometrically-Informed Certified Robustness. By considering the smallest distance to points on the boundary of a set of certifications this approach improves certifications for more than $80\%$ of Tiny-Imagenet instances, yielding an on average $5 \%$ increase in the associated certification. When incorporating training time processes that enhance the certified radius, our technique shows even more promising results, with a uniform $4$ percentage point increase in the achieved certified radius. △ Less

Submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted for Neurips`22, 19 pages, 14 figures, for associated code see https://github.com/andrew-cullen/DoubleBubble

ACM Class: I.2.6; I.4.9

arXiv:2210.05455 [pdf, ps, other]

Unlabelled Sample Compression Schemes for Intersection-Closed Classes and Extremal Classes

Authors: J. Hyam Rubinstein, Benjamin I. P. Rubinstein

Abstract: The sample compressibility of concept classes plays an important role in learning theory, as a sufficient condition for PAC learnability, and more recently as an avenue for robust generalisation in adaptive data analysis. Whether compression schemes of size $O(d)$ must necessarily exist for all classes of VC dimension $d$ is unknown, but conjectured to be true by Warmuth. Recently Chalopin, Chepoi… ▽ More The sample compressibility of concept classes plays an important role in learning theory, as a sufficient condition for PAC learnability, and more recently as an avenue for robust generalisation in adaptive data analysis. Whether compression schemes of size $O(d)$ must necessarily exist for all classes of VC dimension $d$ is unknown, but conjectured to be true by Warmuth. Recently Chalopin, Chepoi, Moran, and Warmuth (2018) gave a beautiful unlabelled sample compression scheme of size VC dimension for all maximum classes: classes that meet the Sauer-Shelah-Perles Lemma with equality. They also offered a counterexample to compression schemes based on a promising approach known as corner peeling. In this paper we simplify and extend their proof technique to deal with so-called extremal classes of VC dimension $d$ which contain maximum classes of VC dimension $d-1$. A criterion is given which would imply that all extremal classes admit unlabelled compression schemes of size $d$. We also prove that all intersection-closed classes with VC dimension $d$ admit unlabelled compression schemes of size at most $11d$. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Appearing at NeurIPS2022

arXiv:2207.11575 [pdf, other]

Testing the Robustness of Learned Index Structures

Authors: Matthias Bachfischer, Renata Borovica-Gajic, Benjamin I. P. Rubinstein

Abstract: While early empirical evidence has supported the case for learned index structures as having favourable average-case performance, little is known about their worst-case performance. By contrast, classical structures are known to achieve optimal worst-case behaviour. This work evaluates the robustness of learned index structures in the presence of adversarial workloads. To simulate adversarial work… ▽ More While early empirical evidence has supported the case for learned index structures as having favourable average-case performance, little is known about their worst-case performance. By contrast, classical structures are known to achieve optimal worst-case behaviour. This work evaluates the robustness of learned index structures in the presence of adversarial workloads. To simulate adversarial workloads, we carry out a data poisoning attack on linear regression models that manipulates the cumulative distribution function (CDF) on which the learned index model is trained. The attack deteriorates the fit of the underlying ML model by injecting a set of poisoning keys into the training dataset, which leads to an increase in the prediction error of the model and thus deteriorates the overall performance of the learned index structure. We assess the performance of various regression methods and the learned index implementations ALEX and PGM-Index. We show that learned index structures can suffer from a significant performance deterioration of up to 20% when evaluated on poisoned vs. non-poisoned datasets. △ Less

Submitted 23 July, 2022; originally announced July 2022.

arXiv:2207.11489 [pdf, ps, other]

Average-Case to (shifted) Worst-Case Reduction for the Trace Reconstruction Problem

Authors: Ittai Rubinstein

Abstract: The {\em insertion-deletion channel} takes as input a binary string $x \in\{0, 1\}^n$, and outputs a string $\widetilde{x}$ where some of the bits have been deleted and others inserted independently at random. In the {\em trace reconstruction problem}, one is given many outputs (called {\em traces}) of the insertion-deletion channel on the same input message $x$, and asked to recover the input mes… ▽ More The {\em insertion-deletion channel} takes as input a binary string $x \in\{0, 1\}^n$, and outputs a string $\widetilde{x}$ where some of the bits have been deleted and others inserted independently at random. In the {\em trace reconstruction problem}, one is given many outputs (called {\em traces}) of the insertion-deletion channel on the same input message $x$, and asked to recover the input message. Nazarov and Peres (STOC 2017), and De, O'Donnell and Servedio (STOC 2017) showed that any string $x$ can be reconstructed from $\exp(O(n^{1/3}))$ traces. Holden, Pemantle, Peres and Zhai (COLT 2018) adapt the techniques used to prove this upper bound, to an algorithm for the average-case trace reconstruction with a sample complexity of $\exp(O(\log^{1/3} n))$. However, it is not clear how to apply their techniques more generally and in particular for the recent worst-case upper bound of $\exp(\widetilde{O}(n^{1/5}))$ shown by Chase (STOC 2021) for the deletion-channel. We prove a general reduction from the average-case to smaller instances of a problem similar to worst-case. Using this reduction and a generalization of Chase's bound, we construct an improved average-case algorithm with a sample complexity of $\exp(\widetilde{O}(\log^{1/5} n))$. Additionally, we show that Chase's upper-bound holds for the insertion-deletion channel as well. △ Less

Submitted 12 August, 2022; v1 submitted 23 July, 2022; originally announced July 2022.

arXiv:2205.10159 [pdf, other]

Getting a-Round Guarantees: Floating-Point Attacks on Certified Robustness

Authors: Jiankai **, Olga Ohrimenko, Benjamin I. P. Rubinstein

Abstract: Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input $\mathbf{x}$, a classifier returns a prediction and a certified radius $R$ with a provable guarantee that any perturbation to $\mathbf{x}$ with $R$-bounded norm will not alter the class… ▽ More Adversarial examples pose a security risk as they can alter decisions of a machine learning classifier through slight input perturbations. Certified robustness has been proposed as a mitigation where given an input $\mathbf{x}$, a classifier returns a prediction and a certified radius $R$ with a provable guarantee that any perturbation to $\mathbf{x}$ with $R$-bounded norm will not alter the classifier's prediction. In this work, we show that these guarantees can be invalidated due to limitations of floating-point representation that cause rounding errors. We design a rounding search method that can efficiently exploit this vulnerability to find adversarial examples against state-of-the-art certifications in two threat models, that differ in how the norm of the perturbation is computed. We show that the attack can be carried out against linear classifiers that have exact certifiable guarantees and against neural networks that have conservative certifications. In the weak threat model, our experiments demonstrate attack success rates over 50% on random linear classifiers, up to 23% on the MNIST dataset for linear SVM, and up to 15% for a neural network. In the strong threat model, the success rates are lower but positive. The floating-point errors exploited by our attacks can range from small to large (e.g., $10^{-13}$ to $10^{3}$) - showing that even negligible errors can be systematically exploited to invalidate guarantees provided by certified robustness. Finally, we propose a formal mitigation approach based on bounded interval arithmetic, encouraging future implementations of robustness certificates to account for limitations of modern computing architecture to provide sound certifiable guarantees. △ Less

Submitted 4 October, 2023; v1 submitted 20 May, 2022; originally announced May 2022.

arXiv:2112.15498 [pdf, other]

State Selection Algorithms and Their Impact on The Performance of Stateful Network Protocol Fuzzing

Authors: Dongge Liu, Van-Thuan Pham, Gidon Ernst, Toby Murray, Benjamin I. P. Rubinstein

Abstract: The statefulness property of network protocol implementations poses a unique challenge for testing and verification techniques, including Fuzzing. Stateful fuzzers tackle this challenge by leveraging state models to partition the state space and assist the test generation process. Since not all states are equally important and fuzzing campaigns have time limits, fuzzers need effective state select… ▽ More The statefulness property of network protocol implementations poses a unique challenge for testing and verification techniques, including Fuzzing. Stateful fuzzers tackle this challenge by leveraging state models to partition the state space and assist the test generation process. Since not all states are equally important and fuzzing campaigns have time limits, fuzzers need effective state selection algorithms to prioritize progressive states over others. Several state selection algorithms have been proposed but they were implemented and evaluated separately on different platforms, making it hard to achieve conclusive findings. In this work, we evaluate an extensive set of state selection algorithms on the same fuzzing platform that is AFLNet, a state-of-the-art fuzzer for network servers. The algorithm set includes existing ones supported by AFLNet and our novel and principled algorithm called AFLNetLegion. The experimental results on the ProFuzzBench benchmark show that (i) the existing state selection algorithms of AFLNet achieve very similar code coverage, (ii) AFLNetLegion clearly outperforms these algorithms in selected case studies, but (iii) the overall improvement appears insignificant. These are unexpected yet interesting findings. We identify problems and share insights that could open opportunities for future research on this topic. △ Less

Submitted 7 January, 2022; v1 submitted 24 December, 2021; originally announced December 2021.

Comments: 10 pages, 8 figures, coloured, conference

arXiv:2112.05307 [pdf, other]

Are We There Yet? Timing and Floating-Point Attacks on Differential Privacy Systems

Authors: Jiankai **, Eleanor McMurtry, Benjamin I. P. Rubinstein, Olga Ohrimenko

Abstract: Differential privacy is a de facto privacy framework that has seen adoption in practice via a number of mature software platforms. Implementation of differentially private (DP) mechanisms has to be done carefully to ensure end-to-end security guarantees. In this paper we study two implementation flaws in the noise generation commonly used in DP systems. First we examine the Gaussian mechanism's su… ▽ More Differential privacy is a de facto privacy framework that has seen adoption in practice via a number of mature software platforms. Implementation of differentially private (DP) mechanisms has to be done carefully to ensure end-to-end security guarantees. In this paper we study two implementation flaws in the noise generation commonly used in DP systems. First we examine the Gaussian mechanism's susceptibility to a floating-point representation attack. The premise of this first vulnerability is similar to the one carried out by Mironov in 2011 against the Laplace mechanism. Our experiments show attack's success against DP algorithms, including deep learning models trained using differentially-private stochastic gradient descent. In the second part of the paper we study discrete counterparts of the Laplace and Gaussian mechanisms that were previously proposed to alleviate the shortcomings of floating-point representation of real numbers. We show that such implementations unfortunately suffer from another side channel: a novel timing attack. An observer that can measure the time to draw (discrete) Laplace or Gaussian noise can predict the noise magnitude, which can then be used to recover sensitive attributes. This attack invalidates differential privacy guarantees of systems implementing such mechanisms. We demonstrate that several commonly used, state-of-the-art implementations of differential privacy are susceptible to these attacks. We report success rates up to 92.56% for floating point attacks on DP-SGD, and up to 99.65% for end-to-end timing attacks on private sum protected with discrete Laplace. Finally, we evaluate and suggest partial mitigations. △ Less

Submitted 15 June, 2022; v1 submitted 9 December, 2021; originally announced December 2021.

Comments: In Proceedings of the 43rd IEEE Symposium on Security and Privacy (IEEE S&P 2022)

Journal ref: https://www.computer.org/csdl/proceedings-article/sp/2022/131600b547/1CIO7Ty2xr2

arXiv:2111.00261 [pdf, other]

Explicit and Efficient Construction of (nearly) Optimal Rate Codes for Binary Deletion Channel and the Poisson Repeat Channel

Authors: Ittai Rubinstein

Abstract: Two of the most common models for channels with synchronisation errors are the Binary Deletion Channel with parameter $p$ ($\text{BDC}_p$) -- a channel where every bit of the codeword is deleted i.i.d with probability $p$, and the Poisson Repeat Channel with parameter $λ$ ($\text{PRC}_λ$) -- a channel where every bit of the codeword is repeated $\text{Poisson}(λ)$ times. Previous constructions b… ▽ More Two of the most common models for channels with synchronisation errors are the Binary Deletion Channel with parameter $p$ ($\text{BDC}_p$) -- a channel where every bit of the codeword is deleted i.i.d with probability $p$, and the Poisson Repeat Channel with parameter $λ$ ($\text{PRC}_λ$) -- a channel where every bit of the codeword is repeated $\text{Poisson}(λ)$ times. Previous constructions based on synchronisation strings yielded codes with rates far lower than the capacities of these channels [CS19, GL18], and the only efficient construction to achieve capacity on the BDC at the time of writing this paper is based on the far more advanced methods of polar codes [TPFV21]. In this work, we present a new method for concatenating synchronisation codes and use it to construct simple and efficient encoding and decoding algorithms for both channels with nearly optimal rates. △ Less

Submitted 17 June, 2022; v1 submitted 30 October, 2021; originally announced November 2021.

arXiv:2109.14208 [pdf, other]

A Communication Security Game on Switched Systems for Autonomous Vehicle Platoons

Authors: Guoxin Sun, Tansu Alpcan, Benjamin I. P. Rubinstein, Seyit Camtepe

Abstract: Vehicle-to-vehicle communication enables autonomous platoons to boost traffic efficiency and safety, while ensuring string stability with a constant spacing policy. However, communication-based controllers are susceptible to a range of cyber-attacks. In this paper, we propose a distributed attack mitigation defense framework with a dual-mode control system reconfiguration scheme to prevent a compr… ▽ More Vehicle-to-vehicle communication enables autonomous platoons to boost traffic efficiency and safety, while ensuring string stability with a constant spacing policy. However, communication-based controllers are susceptible to a range of cyber-attacks. In this paper, we propose a distributed attack mitigation defense framework with a dual-mode control system reconfiguration scheme to prevent a compromised platoon member from causing collisions via message falsification attacks. In particular, we model it as a switched system consisting of a communication-based cooperative controller and a sensor-based local controller and derive conditions to achieve global uniform exponential stability (GUES) as well as string stability in the sense of platoon operation. The switching decision comes from game-theoretic analysis of the attacker and the defender's interactions. In this framework, the attacker acts as a leader that chooses whether to engage in malicious activities and the defender decides which control system to deploy with the help of an anomaly detector. Imperfect detection reports associate the game with imperfect information. A dedicated state constraint further enhances safety against bounded but aggressive message modifications in which a bounded solution may still violate practical constraint e.g. vehicles nearly crashing. Our formulation uniquely combines switched systems with security games to strategically improve the safety of such autonomous vehicle systems. △ Less

Submitted 29 September, 2021; originally announced September 2021.

Comments: 9 pages, 5 figures; full version of paper accepted to CDC2021

arXiv:2109.11803 [pdf, other]

Local Intrinsic Dimensionality Signals Adversarial Perturbations

Authors: Sandamal Weerasinghe, Tansu Alpcan, Sarah M. Erfani, Christopher Leckie, Benjamin I. P. Rubinstein

Abstract: The vulnerability of machine learning models to adversarial perturbations has motivated a significant amount of research under the broad umbrella of adversarial machine learning. Sophisticated attacks may cause learning algorithms to learn decision functions or make decisions with poor predictive performance. In this context, there is a growing body of literature that uses local intrinsic dimensio… ▽ More The vulnerability of machine learning models to adversarial perturbations has motivated a significant amount of research under the broad umbrella of adversarial machine learning. Sophisticated attacks may cause learning algorithms to learn decision functions or make decisions with poor predictive performance. In this context, there is a growing body of literature that uses local intrinsic dimensionality (LID), a local metric that describes the minimum number of latent variables required to describe each data point, for detecting adversarial samples and subsequently mitigating their effects. The research to date has tended to focus on using LID as a practical defence method often without fully explaining why LID can detect adversarial samples. In this paper, we derive a lower-bound and an upper-bound for the LID value of a perturbed data point and demonstrate that the bounds, in particular the lower-bound, has a positive correlation with the magnitude of the perturbation. Hence, we demonstrate that data points that are perturbed by a large amount would have large LID values compared to unperturbed samples, thus justifying its use in the prior literature. Furthermore, our empirical validation demonstrates the validity of the bounds on benchmark datasets. △ Less

Submitted 24 September, 2021; originally announced September 2021.

Comments: 13 pages

arXiv:2109.08266 [pdf, other]

Hard to Forget: Poisoning Attacks on Certified Machine Unlearning

Authors: Neil G. Marchant, Benjamin I. P. Rubinstein, Scott Alfeld

Abstract: The right to erasure requires removal of a user's information from data held by organizations, with rigorous interpretations extending to downstream products such as learned models. Retraining from scratch with the particular user's data omitted fully removes its influence on the resulting model, but comes with a high computational cost. Machine "unlearning" mitigates the cost incurred by full ret… ▽ More The right to erasure requires removal of a user's information from data held by organizations, with rigorous interpretations extending to downstream products such as learned models. Retraining from scratch with the particular user's data omitted fully removes its influence on the resulting model, but comes with a high computational cost. Machine "unlearning" mitigates the cost incurred by full retraining: instead, models are updated incrementally, possibly only requiring retraining when approximation errors accumulate. Rapid progress has been made towards privacy guarantees on the indistinguishability of unlearned and retrained models, but current formalisms do not place practical bounds on computation. In this paper we demonstrate how an attacker can exploit this oversight, highlighting a novel attack surface introduced by machine unlearning. We consider an attacker aiming to increase the computational cost of data removal. We derive and empirically investigate a poisoning attack on certified machine unlearning where strategically designed training data triggers complete retraining when removed. △ Less

Submitted 9 February, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: Align with camera-ready submission to AAAI-22. Changes include: switched to row-wise normalization in Algorithm 3, added link to GitHub repository, added Appendix C with additional results on long-term effectiveness

arXiv:2108.10130 [pdf, other]

No DBA? No regret! Multi-armed bandits for index tuning of analytical and HTAP workloads with provable guarantees

Authors: R. Malinga Perera, Bastian Oetomo, Benjamin I. P. Rubinstein, Renata Borovica-Gajic

Abstract: Automating physical database design has remained a long-term interest in database research due to substantial performance gains afforded by optimised structures. Despite significant progress, a majority of today's commercial solutions are highly manual, requiring offline invocation by database administrators (DBAs) who are expected to identify and supply representative training workloads. Even the… ▽ More Automating physical database design has remained a long-term interest in database research due to substantial performance gains afforded by optimised structures. Despite significant progress, a majority of today's commercial solutions are highly manual, requiring offline invocation by database administrators (DBAs) who are expected to identify and supply representative training workloads. Even the latest advancements like query stores provide only limited support for dynamic environments. This status quo is untenable: identifying representative static workloads is no longer realistic; and physical design tools remain susceptible to the query optimiser's cost misestimates. Furthermore, modern application environments such as hybrid transactional and analytical processing (HTAP) systems render analytical modelling next to impossible. We propose a self-driving approach to online index selection that eschews the DBA and query optimiser, and instead learns the benefits of viable structures through strategic exploration and direct performance observation. We view the problem as one of sequential decision making under uncertainty, specifically within the bandit learning setting. Multi-armed bandits balance exploration and exploitation to provably guarantee average performance that converges to policies that are optimal with perfect hindsight. Our comprehensive empirical evaluation against a state-of-the-art commercial tuning tool demonstrates up to 75% speed-up on shifting and ad-hoc workloads and up to 28% speed-up on static workloads in analytical processing environments. In HTAP environments, our solution provides up to 59% speed-up on shifting and 51% speed-up on static workloads. Furthermore, our bandit framework outperforms deep reinforcement learning (RL) in terms of convergence speed and performance volatility (providing up to 58% speed-up). △ Less

Submitted 23 August, 2021; originally announced August 2021.

Comments: 25 pages, 20 figures, 5 tables. arXiv admin note: substantial text overlap with arXiv:2010.09208

arXiv:2107.08357 [pdf, other]

As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical Translation

Authors: Jun Wang, Chang Xu, Francisco Guzman, Ahmed El-Kishky, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation. In this work we develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing. We explore a variety of numerical translation capabilities a system is expected to exhibit and design effective test examples to expos… ▽ More Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation. In this work we develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing. We explore a variety of numerical translation capabilities a system is expected to exhibit and design effective test examples to expose system underperformance. We find that numerical mistranslation is a general issue: major commercial systems and state-of-the-art research models fail on many of our test examples, for high- and low-resource languages. Our tests reveal novel errors that have not previously been reported in NMT systems, to the best of our knowledge. Lastly, we discuss strategies to mitigate numerical mistranslation. △ Less

Submitted 18 July, 2021; originally announced July 2021.

Comments: Findings of ACL, to appear

arXiv:2107.05243 [pdf, other]

Putting words into the system's mouth: A targeted attack on neural machine translation using monolingual data poisoning

Authors: Jun Wang, Chang Xu, Francisco Guzman, Ahmed El-Kishky, Yuqing Tang, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: Neural machine translation systems are known to be vulnerable to adversarial test inputs, however, as we show in this paper, these systems are also vulnerable to training attacks. Specifically, we propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation. This sample is designed to… ▽ More Neural machine translation systems are known to be vulnerable to adversarial test inputs, however, as we show in this paper, these systems are also vulnerable to training attacks. Specifically, we propose a poisoning attack in which a malicious adversary inserts a small poisoned sample of monolingual text into the training set of a system trained using back-translation. This sample is designed to induce a specific, targeted translation behaviour, such as peddling misinformation. We present two methods for crafting poisoned examples, and show that only a tiny handful of instances, amounting to only 0.02% of the training set, is sufficient to enact a successful attack. We outline a defence method against said attacks, which partly ameliorates the problem. However, we stress that this is a blind-spot in modern NMT, demanding immediate attention. △ Less

Submitted 12 July, 2021; originally announced July 2021.

Comments: Findings of ACL, to appear

arXiv:2106.12057 [pdf, other]

doi 10.1088/1742-5468/ac57b8

Multivariate Generating Functions for Information Spread on Multi-Type Random Graphs

Authors: Yaron Oz, Ittai Rubinstein, Muli Safra

Abstract: We study the spread of information on multi-type directed random graphs. In such graphs the vertices are partitioned into distinct types (communities) that have different transmission rates between themselves and with other types. We construct multivariate generating functions and use multi-type branching processes to derive an equation for the size of the large out-components in multi-type random… ▽ More We study the spread of information on multi-type directed random graphs. In such graphs the vertices are partitioned into distinct types (communities) that have different transmission rates between themselves and with other types. We construct multivariate generating functions and use multi-type branching processes to derive an equation for the size of the large out-components in multi-type random graphs with a general class of degree distributions. We use our methods to analyse the spread of epidemics and verify the results with population based simulations △ Less

Submitted 26 February, 2022; v1 submitted 22 June, 2021; originally announced June 2021.

Comments: 27 pages, 4 figures

arXiv:2011.02142 [pdf]

Not fit for Purpose: A critical analysis of the 'Five Safes'

Authors: Chris Culnane, Benjamin I. P. Rubinstein, David Watts

Abstract: Adopted by government agencies in Australia, New Zealand and the UK as policy instrument or as embodied into legislation, the 'Five Safes' framework aims to manage risks of releasing data derived from personal information. Despite its popularity, the Five Safes has undergone little legal or technical critical analysis. We argue that the Fives Safes is fundamentally flawed: from being disconnected… ▽ More Adopted by government agencies in Australia, New Zealand and the UK as policy instrument or as embodied into legislation, the 'Five Safes' framework aims to manage risks of releasing data derived from personal information. Despite its popularity, the Five Safes has undergone little legal or technical critical analysis. We argue that the Fives Safes is fundamentally flawed: from being disconnected from existing legal protections and appropriation of notions of safety without providing any means to prefer strong technical measures, to viewing disclosure risk as static through time and not requiring repeat assessment. The Five Safes provides little confidence that resulting data sharing is performed using 'safety' best practice or for purposes in service of public interest. △ Less

Submitted 4 November, 2020; originally announced November 2020.

arXiv:2011.00675 [pdf, other]

doi 10.1145/3442381.3450034

A Targeted Attack on Black-Box Neural Machine Translation with Parallel Data Poisoning

Authors: Chang Xu, Jun Wang, Yuqing Tang, Francisco Guzman, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: As modern neural machine translation (NMT) systems have been widely deployed, their security vulnerabilities require close scrutiny. Most recently, NMT systems have been found vulnerable to targeted attacks which cause them to produce specific, unsolicited, and even harmful translations. These attacks are usually exploited in a white-box setting, where adversarial inputs causing targeted translati… ▽ More As modern neural machine translation (NMT) systems have been widely deployed, their security vulnerabilities require close scrutiny. Most recently, NMT systems have been found vulnerable to targeted attacks which cause them to produce specific, unsolicited, and even harmful translations. These attacks are usually exploited in a white-box setting, where adversarial inputs causing targeted translations are discovered for a known target system. However, this approach is less viable when the target system is black-box and unknown to the adversary (e.g., secured commercial systems). In this paper, we show that targeted attacks on black-box NMT systems are feasible, based on poisoning a small fraction of their parallel training data. We show that this attack can be realised practically via targeted corruption of web documents crawled to form the system's training data. We then analyse the effectiveness of the targeted poisoning in two common NMT training scenarios: the from-scratch training and the pre-train & fine-tune paradigm. Our results are alarming: even on the state-of-the-art systems trained with massive parallel data (tens of millions), the attacks are still successful (over 50% success rate) under surprisingly low poisoning budgets (e.g., 0.006%). Lastly, we discuss potential defences to counter such attacks. △ Less

Submitted 15 February, 2021; v1 submitted 1 November, 2020; originally announced November 2020.

Comments: In Proceedings of the 2021 World Wide Web Conference (WWW 2021)

arXiv:2010.09208 [pdf, other]

DBA bandits: Self-driving index tuning under ad-hoc, analytical workloads with safety guarantees

Authors: R. Malinga Perera, Bastian Oetomo, Benjamin I. P. Rubinstein, Renata Borovica-Gajic

Abstract: Automating physical database design has remained a long-term interest in database research due to substantial performance gains afforded by optimised structures. Despite significant progress, a majority of today's commercial solutions are highly manual, requiring offline invocation by database administrators (DBAs) who are expected to identify and supply representative training workloads. Unfortun… ▽ More Automating physical database design has remained a long-term interest in database research due to substantial performance gains afforded by optimised structures. Despite significant progress, a majority of today's commercial solutions are highly manual, requiring offline invocation by database administrators (DBAs) who are expected to identify and supply representative training workloads. Unfortunately, the latest advancements like query stores provide only limited support for dynamic environments. This status quo is untenable: identifying representative static workloads is no longer realistic; and physical design tools remain susceptible to the query optimiser's cost misestimates (stemming from unrealistic assumptions such as attribute value independence and uniformity of data distribution). We propose a self-driving approach to online index selection that eschews the DBA and query optimiser, and instead learns the benefits of viable structures through strategic exploration and direct performance observation. We view the problem as one of sequential decision making under uncertainty, specifically within the bandit learning setting. Multi-armed bandits balance exploration and exploitation to provably guarantee average performance that converges to a fixed policy that is optimal with perfect hindsight. Our comprehensive empirical results demonstrate up to 75% speed-up on shifting and ad-hoc workloads and 28% speed-up on static workloads compared against a state-of-the-art commercial tuning tool. △ Less

Submitted 19 October, 2020; v1 submitted 19 October, 2020; originally announced October 2020.

Comments: 12 pages, 8 figures

arXiv:2009.01923 [pdf, other]

Heterogeneity and Superspreading Effect on Herd Immunity

Authors: Yaron Oz, Ittai Rubinstein, Muli Safra

Abstract: We model and calculate the fraction of infected population necessary to reach herd immunity, taking into account the heterogeneity in infectiousness and susceptibility, as well as the correlation between those two parameters. We show that these cause the effective reproduction number to decrease more rapidly, and consequently have a drastic effect on the estimate of the necessary percentage of the… ▽ More We model and calculate the fraction of infected population necessary to reach herd immunity, taking into account the heterogeneity in infectiousness and susceptibility, as well as the correlation between those two parameters. We show that these cause the effective reproduction number to decrease more rapidly, and consequently have a drastic effect on the estimate of the necessary percentage of the population that has to contract the disease for herd immunity to be reached. We quantify the difference between the size of the infected population when the effective reproduction number decreases below 1 vs. the ultimate fraction of population that had contracted the disease. This sheds light on an important distinction between herd immunity and the end of the disease and highlights the importance of limiting the spread of the disease even if we plan to naturally reach herd immunity. We analyze the effect of various lock-down scenarios on the resulting final fraction of infected population. We discuss implications to COVID-19 and other pandemics and compare our theoretical results to population-based simulations. We consider the dependence of the disease spread on the architecture of the infectiousness graph and analyze different graph architectures and the limitations of the graph models. △ Less

Submitted 15 January, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

Comments: 16 pages, 5 figures, includes population based simulations

arXiv:2008.07352 [pdf, other]

doi 10.1088/1742-5468/abed44

Superspreaders and High Variance Infectious Diseases

Authors: Yaron Oz, Ittai Rubinstein, Muli Safra

Abstract: A well-known characteristic of pandemics such as COVID-19 is the high level of transmission heterogeneity in the infection spread: not all infected individuals spread the disease at the same rate and some individuals (superspreaders) are responsible for most of the infections. To quantify this phenomenon requires the analysis of the effect of the variance and higher moments of the infection distri… ▽ More A well-known characteristic of pandemics such as COVID-19 is the high level of transmission heterogeneity in the infection spread: not all infected individuals spread the disease at the same rate and some individuals (superspreaders) are responsible for most of the infections. To quantify this phenomenon requires the analysis of the effect of the variance and higher moments of the infection distribution. Working in the framework of stochastic branching processes, we derive an approximate analytical formula for the probability of an outbreak in the high variance regime of the infection distribution, verify it numerically and analyze its regime of validity in various examples. We show that it is possible for an outbreak not to occur in the high variance regime even when the basic reproduction number $R_0$ is larger than one and discuss the implications of our results for COVID-19 and other pandemics. △ Less

Submitted 12 October, 2020; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: 9 pages, 5 figures

arXiv:2008.00703 [pdf, other]

doi 10.1103/PhysRevFluids.5.091701

Electro-osmotic Instability of Concentration Enrichment in Curved Geometries for an Aqueous Electrolyte

Authors: Bingrui Xu, Zhibo Gu, Wei Liu, Peng Huo, Yueting Zhou, S. M. Rubinstein, M. Z. Bazant, B. Zaltzman, I. Rubinstein, Daosheng Deng

Abstract: We report that an electro-osmotic instability of concentration enrichment in curved geometries for an aqueous electrolyte, as opposed to the well-known one, is initiated exclusively at the enriched interface (anode), rather than at the depleted one (cathode). For this instability, the limitation of unrealistically high material Peclet number in planar geometry is eliminated by the strong electric… ▽ More We report that an electro-osmotic instability of concentration enrichment in curved geometries for an aqueous electrolyte, as opposed to the well-known one, is initiated exclusively at the enriched interface (anode), rather than at the depleted one (cathode). For this instability, the limitation of unrealistically high material Peclet number in planar geometry is eliminated by the strong electric field arising from the line charge singularity. In a model setup of concentric circular electrodes, we show by stability analysis, numerical simulation, and experimental visualization that instability occurs at the inner anode, below a critical radius of curvature. The stability criterion is also formulated in terms of a critical electric field and extended to arbitrary (2d) geometries by conformal map**. This discovery suggests that transport may be enhanced in processes limited by salt enrichment, such as reverse osmosis, by triggering this instability with needle-like electrodes. △ Less

Submitted 3 August, 2020; originally announced August 2020.

Comments: 5 pages, 4 figures

Journal ref: Phys. Rev. Fluids 5, 091701 (2020)

arXiv:2007.05975 [pdf, ps, other]

A Graph Symmetrisation Bound on Channel Information Leakage under Blowfish Privacy

Authors: Tobias Edwards, Benjamin I. P. Rubinstein, Zuhe Zhang, Sanming Zhou

Abstract: Blowfish privacy is a recent generalisation of differential privacy that enables improved utility while maintaining privacy policies with semantic guarantees, a factor that has driven the popularity of differential privacy in computer science. This paper relates Blowfish privacy to an important measure of privacy loss of information channels from the communications theory community: min-entropy le… ▽ More Blowfish privacy is a recent generalisation of differential privacy that enables improved utility while maintaining privacy policies with semantic guarantees, a factor that has driven the popularity of differential privacy in computer science. This paper relates Blowfish privacy to an important measure of privacy loss of information channels from the communications theory community: min-entropy leakage. Symmetry in an input data neighbouring relation is central to known connections between differential privacy and min-entropy leakage. But while differential privacy exhibits strong symmetry, Blowfish neighbouring relations correspond to arbitrary simple graphs owing to the framework's flexible privacy policies. To bound the min-entropy leakage of Blowfish-private mechanisms we organise our analysis over symmetrical partitions corresponding to orbits of graph automorphism groups. A construction meeting our bound with asymptotic equality demonstrates tightness. △ Less

Submitted 13 October, 2021; v1 submitted 12 July, 2020; originally announced July 2020.

Comments: 11 pages, 5 figures; accepted to IEEE Transactions on Information Theory

arXiv:2006.15417 [pdf, other]

Invertible Concept-based Explanations for CNN Models with Non-negative Concept Activation Vectors

Authors: Ruihan Zhang, Prashan Madumal, Tim Miller, Krista A. Ehinger, Benjamin I. P. Rubinstein

Abstract: Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work on explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the form o… ▽ More Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work on explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the form of concept activation vectors (CAVs). CAVs contain concept-level information and could be learned via clustering. In this work, we rethink the ACE algorithm of Ghorbani et~al., proposing an alternative invertible concept-based explanation (ICE) framework to overcome its shortcomings. Based on the requirements of fidelity (approximate models to target models) and interpretability (being meaningful to people), we design measurements and evaluate a range of matrix factorization methods with our framework. We find that non-negative concept activation vectors (NCAVs) from non-negative matrix factorization provide superior performance in interpretability and fidelity based on computational and human subject experiments. Our framework provides both local and global concept-level explanations for pre-trained CNN models. △ Less

Submitted 17 June, 2021; v1 submitted 27 June, 2020; originally announced June 2020.

arXiv:2006.13120 [pdf, other]

Discrete Few-Shot Learning for Pan Privacy

Authors: Roei Gelbhart, Benjamin I. P. Rubinstein

Abstract: In this paper we present the first baseline results for the task of few-shot learning of discrete embedding vectors for image recognition. Few-shot learning is a highly researched task, commonly leveraged by recognition systems that are resource constrained to train on a small number of images per class. Few-shot systems typically store a continuous embedding vector of each class, posing a risk to… ▽ More In this paper we present the first baseline results for the task of few-shot learning of discrete embedding vectors for image recognition. Few-shot learning is a highly researched task, commonly leveraged by recognition systems that are resource constrained to train on a small number of images per class. Few-shot systems typically store a continuous embedding vector of each class, posing a risk to privacy where system breaches or insider threats are a concern. Using discrete embedding vectors, we devise a simple cryptographic protocol, which uses one-way hash functions in order to build recognition systems that do not store their users' embedding vectors directly, thus providing the guarantee of computational pan privacy in a practical and wide-spread setting. △ Less

Submitted 23 June, 2020; originally announced June 2020.

arXiv:2006.06963 [pdf, other]

Needle in a Haystack: Label-Efficient Evaluation under Extreme Class Imbalance

Authors: Neil G. Marchant, Benjamin I. P. Rubinstein

Abstract: Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling ha… ▽ More Important tasks like record linkage and extreme classification demonstrate extreme class imbalance, with 1 minority instance to every 1 million or more majority instances. Obtaining a sufficient sample of all classes, even just to achieve statistically-significant evaluation, is so challenging that most current approaches yield poor estimates or incur impractical cost. Where importance sampling has been levied against this challenge, restrictive constraints are placed on performance metrics, estimates do not come with appropriate guarantees, or evaluations cannot adapt to incoming labels. This paper develops a framework for online evaluation based on adaptive importance sampling. Given a target performance metric and model for $p(y|x)$, the framework adapts a distribution over items to label in order to maximize statistical precision. We establish strong consistency and a central limit theorem for the resulting performance estimates, and instantiate our framework with worked examples that leverage Dirichlet-tree models. Experiments demonstrate an average MSE superior to state-of-the-art on fixed label budgets. △ Less

Submitted 2 June, 2021; v1 submitted 12 June, 2020; originally announced June 2020.

Comments: 30 pages, 8 figures, updated to match version accepted for publication at KDD'21

ACM Class: H.3.4; I.5.2

arXiv:2005.13787 [pdf, other]

doi 10.1007/978-3-030-47436-2_12

Assessing Centrality Without Knowing Connections

Authors: Leyla Roohi, Benjamin I. P. Rubinstein, Vanessa Teague

Abstract: We consider the privacy-preserving computation of node influence in distributed social networks, as measured by egocentric betweenness centrality (EBC). Motivated by modern communication networks spanning multiple providers, we show for the first time how multiple mutually-distrusting parties can successfully compute node EBC while revealing only differentially-private information about their inte… ▽ More We consider the privacy-preserving computation of node influence in distributed social networks, as measured by egocentric betweenness centrality (EBC). Motivated by modern communication networks spanning multiple providers, we show for the first time how multiple mutually-distrusting parties can successfully compute node EBC while revealing only differentially-private information about their internal network connections. A theoretical utility analysis upper bounds a primary source of private EBC error---private release of ego networks---with high probability. Empirical results demonstrate practical applicability with a low 1.07 relative error achievable at strong privacy budget $ε=0.1$ on a Facebook graph, and insignificant performance degradation as the number of network provider parties grows. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: Full report of paper appearing in PAKDD2020

Journal ref: In: Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science, vol 12085. Springer, Cham, pages 152-163 (2020)

arXiv:2002.06311 [pdf, other]

doi 10.1145/3324884.3416629

Legion: Best-First Concolic Testing

Authors: Dongge Liu, Gidon Ernst, Toby Murray, Benjamin I. P. Rubinstein

Abstract: Concolic execution and fuzzing are two complementary coverage-based testing techniques. How to achieve the best of both remains an open challenge. To address this research problem, we propose and evaluate Legion. Legion re-engineers the Monte Carlo tree search (MCTS) framework from the AI literature to treat automated test generation as a problem of sequential decision-making under uncertainty. It… ▽ More Concolic execution and fuzzing are two complementary coverage-based testing techniques. How to achieve the best of both remains an open challenge. To address this research problem, we propose and evaluate Legion. Legion re-engineers the Monte Carlo tree search (MCTS) framework from the AI literature to treat automated test generation as a problem of sequential decision-making under uncertainty. Its best-first search strategy provides a principled way to learn the most promising program states to investigate at each search iteration, based on observed rewards from previous iterations. Legion incorporates a form of directed fuzzing that we call approximate path-preserving fuzzing (APPFuzzing) to investigate program states selected by MCTS. APPFuzzing serves as the Monte Carlo simulation technique and is implemented by extending prior work on constrained sampling. We evaluate Legion against competitors on 2531 benchmarks from the coverage category of Test-Comp 2020, as well as measuring its sensitivity to hyperparameters, demonstrating its effectiveness on a wide variety of input programs. △ Less

Submitted 22 September, 2020; v1 submitted 14 February, 2020; originally announced February 2020.

Comments: 12 pages, 2 Algorithms, 3 Figures, 2 Tables, ASE2020

arXiv:1911.10326 [pdf]

doi 10.1364/OE.383217

Deep learning reconstruction of ultrashort pulses from 2D spatial intensity patterns recorded by an all-in-line system in a single-shot

Authors: Ron Ziv, Alex Dikopoltsev, Tom Zahavy, Ittai Rubinstein, Pavel Sidorenko, Oren Cohen, Mordechai Segev

Abstract: We propose a simple all-in-line single-shot scheme for diagnostics of ultrashort laser pulses, consisting of a multi-mode fiber, a nonlinear crystal and a CCD camera. The system records a 2D spatial intensity pattern, from which the pulse shape (amplitude and phase) are recovered, through a fast Deep Learning algorithm. We explore this scheme in simulations and demonstrate the recovery of ultrasho… ▽ More We propose a simple all-in-line single-shot scheme for diagnostics of ultrashort laser pulses, consisting of a multi-mode fiber, a nonlinear crystal and a CCD camera. The system records a 2D spatial intensity pattern, from which the pulse shape (amplitude and phase) are recovered, through a fast Deep Learning algorithm. We explore this scheme in simulations and demonstrate the recovery of ultrashort pulses, robustness to noise in measurements and to inaccuracies in the parameters of the system components. Our technique mitigates the need for commonly used iterative optimization reconstruction methods, which are usually slow and hampered by the presence of noise. These features make our concept system advantageous for real time probing of ultrafast processes and noisy conditions. Moreover, this work exemplifies that using deep learning we can unlock new types of systems for pulse recovery. △ Less

Submitted 23 November, 2019; originally announced November 2019.

arXiv:1909.06039 [pdf, other]

doi 10.1080/10618600.2020.1825451

d-blink: Distributed End-to-End Bayesian Entity Resolution

Authors: Neil G. Marchant, Andee Kaplan, Daniel N. Elazar, Benjamin I. P. Rubinstein, Rebecca C. Steorts

Abstract: Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing mode… ▽ More Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of uncertainty. Despite these advantages, existing models are severely limited in practice, as standard inference algorithms scale quadratically in the number of records. While scaling can be managed by fitting the model on separate blocks of the data, such a naïve approach may induce significant error in the posterior. In this paper, we propose a principled model for scalable Bayesian ER, called "distributed Bayesian linkage" or d-blink, which jointly performs blocking and ER without compromising posterior correctness. Our approach relies on several key ideas, including: (i) an auxiliary variable representation that induces a partition of the entities and records into blocks; (ii) a method for constructing well-balanced blocks based on k-d trees; (iii) a distributed partially-collapsed Gibbs sampler with improved mixing; and (iv) fast algorithms for performing Gibbs updates. Empirical studies on six data sets---including a case study on the 2010 Decennial Census---demonstrate the scalability and effectiveness of our approach. △ Less

Submitted 22 September, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: 32 pages, 6 figures, 5 tables. Includes 22 pages of supplementary material. This revision incorporates a case study on the 2010 U.S. Decennial Census

MSC Class: 62F15; 65C40; 68W15

arXiv:1908.05004 [pdf, other]

Stop the Open Data Bus, We Want to Get Off

Authors: Dr. Chris Culnane, A/Prof. Benjamin I. P. Rubinstein, A/Prof. Vanessa Teague

Abstract: The subject of this report is the re-identification of individuals in the Myki public transport dataset released as part of the Melbourne Datathon 2018. We demonstrate the ease with which we were able to re-identify ourselves, our co-travellers, and complete strangers; our analysis raises concerns about the nature and granularity of the data released, in particular the ability to identify vulnerab… ▽ More The subject of this report is the re-identification of individuals in the Myki public transport dataset released as part of the Melbourne Datathon 2018. We demonstrate the ease with which we were able to re-identify ourselves, our co-travellers, and complete strangers; our analysis raises concerns about the nature and granularity of the data released, in particular the ability to identify vulnerable or sensitive groups. △ Less

Submitted 14 August, 2019; originally announced August 2019.

arXiv:1902.09062 [pdf, other]

Adversarial Reinforcement Learning under Partial Observability in Autonomous Computer Network Defence

Authors: Yi Han, David Hubczenko, Paul Montague, Olivier De Vel, Tamas Abraham, Benjamin I. P. Rubinstein, Christopher Leckie, Tansu Alpcan, Sarah Erfani

Abstract: Recent studies have demonstrated that reinforcement learning (RL) agents are susceptible to adversarial manipulation, similar to vulnerabilities previously demonstrated in the supervised learning setting. While most existing work studies the problem in the context of computer vision or console games, this paper focuses on reinforcement learning in autonomous cyber defence under partial observabili… ▽ More Recent studies have demonstrated that reinforcement learning (RL) agents are susceptible to adversarial manipulation, similar to vulnerabilities previously demonstrated in the supervised learning setting. While most existing work studies the problem in the context of computer vision or console games, this paper focuses on reinforcement learning in autonomous cyber defence under partial observability. We demonstrate that under the black-box setting, where the attacker has no direct access to the target RL model, causative attacks---attacks that target the training process---can poison RL agents even if the attacker only has partial observability of the environment. In addition, we propose an inversion defence method that aims to apply the opposite perturbation to that which an attacker might use to generate their adversarial samples. Our experimental results illustrate that the countermeasure can effectively reduce the impact of the causative attack, while not significantly affecting the training process in non-attack scenarios. △ Less

Submitted 16 August, 2020; v1 submitted 24 February, 2019; originally announced February 2019.

Comments: 8 pages, 4 figures

arXiv:1902.08918 [pdf, other]

Truth Inference at Scale: A Bayesian Model for Adjudicating Highly Redundant Crowd Annotations

Authors: Yuan Li, Benjamin I. P. Rubinstein, Trevor Cohn

Abstract: Crowd-sourcing is a cheap and popular means of creating training and evaluation datasets for machine learning, however it poses the problem of `truth inference', as individual workers cannot be wholly trusted to provide reliable annotations. Research into models of annotation aggregation attempts to infer a latent `true' annotation, which has been shown to improve the utility of crowd-sourced data… ▽ More Crowd-sourcing is a cheap and popular means of creating training and evaluation datasets for machine learning, however it poses the problem of `truth inference', as individual workers cannot be wholly trusted to provide reliable annotations. Research into models of annotation aggregation attempts to infer a latent `true' annotation, which has been shown to improve the utility of crowd-sourced data. However, existing techniques beat simple baselines only in low redundancy settings, where the number of annotations per instance is low ($\le 3$), or in situations where workers are unreliable and produce low quality annotations (e.g., through spamming, random, or adversarial behaviours.) As we show, datasets produced by crowd-sourcing are often not of this type: the data is highly redundantly annotated ($\ge 5$ annotations per instance), and the vast majority of workers produce high quality outputs. In these settings, the majority vote heuristic performs very well, and most truth inference models underperform this simple baseline. We propose a novel technique, based on a Bayesian graphical model with conjugate priors, and simple iterative expectation-maximisation inference. Our technique produces competitive performance to the state-of-the-art benchmark methods, and is the only method that significantly outperforms the majority vote heuristic at one-sided level 0.025, shown by significance tests. Moreover, our technique is simple, is implemented in only 50 lines of code, and trains in seconds. △ Less

Submitted 24 February, 2019; originally announced February 2019.

Comments: Accepted at the Web Conference/WWW 2019 (camera ready)

arXiv:1902.07500 [pdf, other]

A Note on Bounding Regret of the C$^2$UCB Contextual Combinatorial Bandit

Authors: Bastian Oetomo, Malinga Perera, Renata Borovica-Gajic, Benjamin I. P. Rubinstein

Abstract: We revisit the proof by Qin et al. (2014) of bounded regret of the C$^2$UCB contextual combinatorial bandit. We demonstrate an error in the proof of volumetric expansion of the moment matrix, used in upper bounding a function of context vector norms. We prove a relaxed inequality that yields the originally-stated regret bound. We revisit the proof by Qin et al. (2014) of bounded regret of the C$^2$UCB contextual combinatorial bandit. We demonstrate an error in the proof of volumetric expansion of the moment matrix, used in upper bounding a function of context vector norms. We prove a relaxed inequality that yields the originally-stated regret bound. △ Less

Submitted 20 February, 2019; originally announced February 2019.

Comments: 3 pages

arXiv:1901.05562 [pdf, other]

Differentially-Private Two-Party Egocentric Betweenness Centrality

Authors: Leyla Roohi, Benjamin I. P. Rubinstein, Vanessa Teague

Abstract: We describe a novel protocol for computing the egocentric betweenness centrality of a node when relevant edge information is spread between two mutually distrusting parties such as two telecommunications providers. While each node belongs to one network or the other, its ego network might include edges unknown to its network provider. We develop a protocol of differentially-private mechanisms to h… ▽ More We describe a novel protocol for computing the egocentric betweenness centrality of a node when relevant edge information is spread between two mutually distrusting parties such as two telecommunications providers. While each node belongs to one network or the other, its ego network might include edges unknown to its network provider. We develop a protocol of differentially-private mechanisms to hide each network's internal edge structure from the other; and contribute a new two-stage stratified sampler for exponential improvement to time and space efficiency. Empirical results on several open graph data sets demonstrate practical relative error rates while delivering strong privacy guarantees, such as 16% error on a Facebook data set. △ Less

Submitted 16 January, 2019; originally announced January 2019.

Comments: 10 pages; full report with proofs of paper accepted into INFOCOM'2019

arXiv:1808.05770 [pdf, other]

Reinforcement Learning for Autonomous Defence in Software-Defined Networking

Authors: Yi Han, Benjamin I. P. Rubinstein, Tamas Abraham, Tansu Alpcan, Olivier De Vel, Sarah Erfani, David Hubczenko, Christopher Leckie, Paul Montague

Abstract: Despite the successful application of machine learning (ML) in a wide range of domains, adaptability---the very property that makes machine learning desirable---can be exploited by adversaries to contaminate training and evade classification. In this paper, we investigate the feasibility of applying a specific class of machine learning algorithms, namely, reinforcement learning (RL) algorithms, fo… ▽ More Despite the successful application of machine learning (ML) in a wide range of domains, adaptability---the very property that makes machine learning desirable---can be exploited by adversaries to contaminate training and evade classification. In this paper, we investigate the feasibility of applying a specific class of machine learning algorithms, namely, reinforcement learning (RL) algorithms, for autonomous cyber defence in software-defined networking (SDN). In particular, we focus on how an RL agent reacts towards different forms of causative attacks that poison its training process, including indiscriminate and targeted, white-box and black-box attacks. In addition, we also study the impact of the attack timing, and explore potential countermeasures such as adversarial training. △ Less

Submitted 17 August, 2018; originally announced August 2018.

Comments: 20 pages, 8 figures

arXiv:1802.07975 [pdf, other]

Options for encoding names for data linking at the Australian Bureau of Statistics

Authors: Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague

Abstract: Publicly, ABS has said it would use a cryptographic hash function to convert names collected in the 2016 Census of Population and Housing into an unrecognisable value in a way that is not reversible. In 2016, the ABS engaged the University of Melbourne to provide expert advice on cryptographic hash functions to meet this objective. For complex unit-record level data, including Census data, auxil… ▽ More Publicly, ABS has said it would use a cryptographic hash function to convert names collected in the 2016 Census of Population and Housing into an unrecognisable value in a way that is not reversible. In 2016, the ABS engaged the University of Melbourne to provide expert advice on cryptographic hash functions to meet this objective. For complex unit-record level data, including Census data, auxiliary data can be often be used to link individual records, even without names. This is the basis of ABS's existing bronze linking. This means that records can probably be re-identified without the encoded name anyway. Protection against re-identification depends on good processes within ABS. The undertaking on the encoding of names should therefore be considered in the full context of auxiliary data and ABS processes. There are several reasonable interpretations: 1. That the encoding cannot be reversed except with a secret key held by ABS. This is the property achieved by encryption (Option 1), if properly implemented; 2. That the encoding, taken alone without auxiliary data, cannot be reversed to a single value. This is the property achieved by lossy encoding (Option 2), if properly implemented; 3. That the encoding doesn't make re-identification easier, or increase the number of records that can be re-identified, except with a secret key held by ABS. This is the property achieved by HMAC-based linkage key derivation using subsets of attributes (Option 3), if properly implemented. We explain and compare the privacy and accuracy guarantees of five possible approaches. Options 4 and 5 investigate more sophisticated options for future data linking. We also explain how some commonly-advocated techniques can be reversed, and hence should not be used. △ Less

Submitted 22 February, 2018; originally announced February 2018.

Comments: University of Melbourne Research Contract 85449779. After receiving a draft of this report, ABS conducted a further assessment of Options 2 and 3, which will be published on their website

arXiv:1712.05627 [pdf]

Health Data in an Open World

Authors: Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague

Abstract: With the aim of informing sound policy about data sharing and privacy, we describe successful re-identification of patients in an Australian de-identified open health dataset. As in prior studies of similar datasets, a few mundane facts often suffice to isolate an individual. Some people can be identified by name based on publicly available information. Decreasing the precision of the unit-record… ▽ More With the aim of informing sound policy about data sharing and privacy, we describe successful re-identification of patients in an Australian de-identified open health dataset. As in prior studies of similar datasets, a few mundane facts often suffice to isolate an individual. Some people can be identified by name based on publicly available information. Decreasing the precision of the unit-record level data, or perturbing it statistically, makes re-identification gradually harder at a substantial cost to utility. We also examine the value of related datasets in improving the accuracy and confidence of re-identification. Our re-identifications were performed on a 10% sample dataset, but a related open Australian dataset allows us to infer with high confidence that some individuals in the sample have been correctly re-identified. Finally, we examine the combination of the open datasets with some commercial datasets that are known to exist but are not in our possession. We show that they would further increase the ease of re-identification. △ Less

Submitted 15 December, 2017; originally announced December 2017.

arXiv:1712.00871 [pdf, other]

Vulnerabilities in the use of similarity tables in combination with pseudonymisation to preserve data privacy in the UK Office for National Statistics' Privacy-Preserving Record Linkage

Authors: Chris Culnane, Benjamin I. P. Rubinstein, Vanessa Teague

Abstract: In the course of a survey of privacy-preserving record linkage, we reviewed the approach taken by the UK Office for National Statistics (ONS) as described in their series of reports "Beyond 2011". Our review identifies a number of matters of concern. Some of the issues discovered are sufficiently severe to present a risk to privacy. In the course of a survey of privacy-preserving record linkage, we reviewed the approach taken by the UK Office for National Statistics (ONS) as described in their series of reports "Beyond 2011". Our review identifies a number of matters of concern. Some of the issues discovered are sufficiently severe to present a risk to privacy. △ Less

Submitted 3 December, 2017; originally announced December 2017.

Showing 1–50 of 83 results for author: Rubinstein, I