Search | arXiv e-print repository

Improving Correlation Capture in Generating Imbalanced Data using Differentially Private Conditional GANs

Authors: Chang Sun, Johan van Soest, Michel Dumontier

Abstract: Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN… ▽ More Despite the remarkable success of Generative Adversarial Networks (GANs) on text, images, and videos, generating high-quality tabular data is still under development owing to some unique challenges such as capturing dependencies in imbalanced data, optimizing the quality of synthetic patient data while preserving privacy. In this paper, we propose DP-CGANS, a differentially private conditional GAN framework consisting of data transformation, sampling, conditioning, and networks training to generate realistic and privacy-preserving tabular data. DP-CGANS distinguishes categorical and continuous variables and transforms them to latent space separately. Then, we structure a conditional vector as an additional input to not only presents the minority class in the imbalanced data, but also capture the dependency between variables. We inject statistical noise to the gradients in the networking training process of DP-CGANS to provide a differential privacy guarantee. We extensively evaluate our model with state-of-the-art generative models on three public datasets and two real-world personal health datasets in terms of statistical similarity, machine learning performance, and privacy measurement. We demonstrate that our model outperforms other comparable models, especially in capturing dependency between variables. Finally, we present the balance between data utility and privacy in synthetic data generation considering the different data structure and characteristics of real-world datasets such as imbalance variables, abnormal distributions, and sparsity of data. △ Less

Submitted 28 June, 2022; originally announced June 2022.

ACM Class: I.2; E.0

arXiv:2107.02482 [pdf]

A Knowledge graph representation of baseline characteristics for the Dutch proton therapy research registry

Authors: Matthijs Sloep, Petros Kalendralis, Ananya Choudhury, Lerau Seyben, Jasper Snel, Nibin Moni George, Martijn Veening, Johannes A. Langendijk, Andre Dekker, Johan van Soest, Rianne Fijten

Abstract: Cancer registries collect multisource data and provide valuable information that can lead to unique research opportunities. In the Netherlands, a registry and model-based approach (MBA) are used for the selection of patients that are eligible for proton therapy. We collected baseline characteristics including demographic, clinical, tumour and treatment information. These data were transformed into… ▽ More Cancer registries collect multisource data and provide valuable information that can lead to unique research opportunities. In the Netherlands, a registry and model-based approach (MBA) are used for the selection of patients that are eligible for proton therapy. We collected baseline characteristics including demographic, clinical, tumour and treatment information. These data were transformed into a machine readable format using the FAIR (Findable, Accessible, Interoperable, Reusable) data principles and resulted in a knowledge graph with baseline characteristics of proton therapy patients. With this approach, we enable the possibility of linking external data sources and optimal flexibility to easily adapt the data structure of the existing knowledge graph to the needs of the clinic. △ Less

Submitted 6 July, 2021; originally announced July 2021.

arXiv:1812.00991 [pdf]

Analyzing Partitioned FAIR Health Data Responsibly

Authors: Chang Sun, Lianne Ippel, Birgit Wouters, Johan van Soest, Alexander Malic, Onaopepo Adekunle, Bob van den Berg, Marco Puts, Ole Mussmann, Annemarie Koster, Carla van der Kallen, David Townend, Andre Dekker, Michel Dumontier

Abstract: It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care utilization. The analysis involves combining data f… ▽ More It is widely anticipated that the use of health-related big data will enable further understanding and improvements in human health and wellbeing. Our current project, funded through the Dutch National Research Agenda, aims to explore the relationship between the development of diabetes and socio-economic factors such as lifestyle and health care utilization. The analysis involves combining data from the Maastricht Study (DMS), a prospective clinical study, and data collected by Statistics Netherlands (CBS) as part of its routine operations. However, a wide array of social, legal, technical, and scientific issues hinder the analysis. In this paper, we describe these challenges and our progress towards addressing them. △ Less

Submitted 2 December, 2018; originally announced December 2018.

Comments: 6 pages, 1 figure, preliminary result, project report

ACM Class: E.1; E.3; H.2.4; H.2.8

Showing 1–3 of 3 results for author: van Soest, J