Faking feature importance: A cautionary tale on the use of differentially-private synthetic data
Authors:
Oscar Giles,
Kasra Hosseini,
Grigorios Mingas,
Oliver Strickson,
Louise Bowler,
Camila Rangel Smith,
Harrison Wilde,
Jen Ning Lim,
Bilal Mateen,
Kasun Amarasinghe,
Rayid Ghani,
Alison Heppenstall,
Nik Lomax,
Nick Malleson,
Martin O'Reilly,
Sebastian Vollmerteke
Abstract:
Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in the exploratory phase of the machine learning workflow, which involves understanding, engineering a…
▽ More
Synthetic datasets are often presented as a silver-bullet solution to the problem of privacy-preserving data publishing. However, for many applications, synthetic data has been shown to have limited utility when used to train predictive models. One promising potential application of these data is in the exploratory phase of the machine learning workflow, which involves understanding, engineering and selecting features. This phase often involves considerable time, and depends on the availability of data. There would be substantial value in synthetic data that permitted these steps to be carried out while, for example, data access was being negotiated, or with fewer information governance restrictions. This paper presents an empirical analysis of the agreement between the feature importance obtained from raw and from synthetic data, on a range of artificially generated and real-world datasets (where feature importance represents how useful each feature is when predicting a the outcome). We employ two differentially-private methods to produce synthetic data, and apply various utility measures to quantify the agreement in feature importance as this varies with the level of privacy. Our results indicate that synthetic data can sometimes preserve several representations of the ranking of feature importance in simple settings but their performance is not consistent and depends upon a number of factors. Particular caution should be exercised in more nuanced real-world settings, where synthetic data can lead to differences in ranked feature importance that could alter key modelling decisions. This work has important implications for develo** synthetic versions of highly sensitive data sets in fields such as finance and healthcare.
△ Less
Submitted 2 March, 2022;
originally announced March 2022.
Fast Distance Fields for Fluid Dynamics Mesh Generation on Graphics Hardware
Authors:
A. Roosing,
O. T. Strickson,
N. Nikiforakis
Abstract:
We present a CUDA accelerated implementation of the Characteristic/Scan Conversion algorithm to generate narrow band signed distance fields in logically Cartesian grids. We outline an approach of task and data management on GPUs based on an input of a closed triangulated surface with the aim of reducing pre-processing and mesh-generation times. The work demonstrates a fast signed distance field ge…
▽ More
We present a CUDA accelerated implementation of the Characteristic/Scan Conversion algorithm to generate narrow band signed distance fields in logically Cartesian grids. We outline an approach of task and data management on GPUs based on an input of a closed triangulated surface with the aim of reducing pre-processing and mesh-generation times. The work demonstrates a fast signed distance field generation of triangulated surfaces with tens of thousands to several million features in high resolution domains. We present improvements to the robustness of the original algorithm and an overview of handling geometric data.
△ Less
Submitted 1 March, 2019;
originally announced March 2019.