-
SynDiffix: More accurate synthetic structured data
Authors:
Paul Francis,
Cristian Berneanu,
Edon Gashi
Abstract:
This paper introduces SynDiffix, a mechanism for generating statistically accurate, anonymous synthetic data for structured data. Recent open source and commercial systems use Generative Adversarial Networks or Transformed Auto Encoders to synthesize data, and achieve anonymity through overfitting-avoidance. By contrast, SynDiffix exploits traditional mechanisms of aggregation, noise addition, and…
▽ More
This paper introduces SynDiffix, a mechanism for generating statistically accurate, anonymous synthetic data for structured data. Recent open source and commercial systems use Generative Adversarial Networks or Transformed Auto Encoders to synthesize data, and achieve anonymity through overfitting-avoidance. By contrast, SynDiffix exploits traditional mechanisms of aggregation, noise addition, and suppression among others. Compared to CTGAN, ML models generated from SynDiffix are twice as accurate, marginal and column pairs data quality is one to two orders of magnitude more accurate, and execution time is two orders of magnitude faster. Compared to the best commercial product we measured (MostlyAI), ML model accuracy is comparable, marginal and pairs accuracy is 5 to 10 times better, and execution time is an order of magnitude faster. Similar to the other approaches, SynDiffix anonymization is very strong. This paper describes SynDiffix and compares its performance with other popular open source and commercial systems.
△ Less
Submitted 16 November, 2023;
originally announced November 2023.
-
Diffix Elm: Simple Diffix
Authors:
Paul Francis,
Sebastian Probst-Eide,
David Wagner,
Felix Bauer,
Cristian Berneanu,
Edon Gashi
Abstract:
Historically, strong data anonymization requires substantial domain expertise and custom design for the given data set and use case. Diffix is an anonymization framework designed to make strong data anonymization available to non-experts. This paper describes Diffix Elm, a version of Diffix that is very easy to use at the expense of query features. We describe Diffix Elm, and show that it provides…
▽ More
Historically, strong data anonymization requires substantial domain expertise and custom design for the given data set and use case. Diffix is an anonymization framework designed to make strong data anonymization available to non-experts. This paper describes Diffix Elm, a version of Diffix that is very easy to use at the expense of query features. We describe Diffix Elm, and show that it provides strong anonymity based on the General Data Protection Regulation (GDPR) criteria.
This document is the third version of Diffix Elm. The second version added ceiling, round, and bucket\_width functions (in addition to floor). This document adds the ability to protect multiple different kinds of protected entities (a feature not found in earlier versions of Diffix). It also adds counting distinct values for any column (rather than only the AID column).
△ Less
Submitted 20 June, 2022; v1 submitted 12 January, 2022;
originally announced January 2022.
-
Diffix-Birch: Extending Diffix-Aspen
Authors:
Paul Francis,
Sebastian Probst-Eide,
Pawel Obrok,
Cristian Berneanu,
Sasa Juric,
Reinhard Munz
Abstract:
A longstanding open problem is that of how to get high quality statistics through direct queries to databases containing information about individuals without revealing information specific to those individuals. Diffix is a framework for anonymous database query that adds noise based on the filter conditions in the query. A previous paper described the first version, called diffix-aspen. This vers…
▽ More
A longstanding open problem is that of how to get high quality statistics through direct queries to databases containing information about individuals without revealing information specific to those individuals. Diffix is a framework for anonymous database query that adds noise based on the filter conditions in the query. A previous paper described the first version, called diffix-aspen. This version, diffix-birch, extends that description to include a wide variety of common features found in SQL. It describes attacks associated with various features, and the anonymization steps used to defend against those attacks. This paper describes diffix-birch, which was used for the bounty program sponsored by Aircloak starting December 2017.
△ Less
Submitted 21 August, 2019; v1 submitted 6 June, 2018;
originally announced June 2018.