-
SurvivalGAN: Generating Time-to-Event Data for Survival Analysis
Authors:
Alexander Norcliffe,
Bogdan Cebere,
Fergus Imrie,
Pietro Lio,
Mihaela van der Schaar
Abstract:
Synthetic data is becoming an increasingly promising technology, and successful applications can improve privacy, fairness, and data democratization. While there are many methods for generating synthetic tabular data, the task remains non-trivial and unexplored for specific scenarios. One such scenario is survival data. Here, the key difficulty is censoring: for some instances, we are not aware of…
▽ More
Synthetic data is becoming an increasingly promising technology, and successful applications can improve privacy, fairness, and data democratization. While there are many methods for generating synthetic tabular data, the task remains non-trivial and unexplored for specific scenarios. One such scenario is survival data. Here, the key difficulty is censoring: for some instances, we are not aware of the time of event, or if one even occurred. Imbalances in censoring and time horizons cause generative models to experience three new failure modes specific to survival analysis: (1) generating too few at-risk members; (2) generating too many at-risk members; and (3) censoring too early. We formalize these failure modes and provide three new generative metrics to quantify them. Following this, we propose SurvivalGAN, a generative model that handles survival data firstly by addressing the imbalance in the censoring and event horizons, and secondly by using a dedicated mechanism for approximating time-to-event/censoring. We evaluate this method via extensive experiments on medical datasets. SurvivalGAN outperforms multiple baselines at generating survival data, and in particular addresses the failure modes as measured by the new metrics, in addition to improving downstream performance of survival models trained on the synthetic data.
△ Less
Submitted 24 February, 2023;
originally announced February 2023.
-
Synthcity: facilitating innovative use cases of synthetic data in different data modalities
Authors:
Zhaozhi Qian,
Bogdan-Constantin Cebere,
Mihaela van der Schaar
Abstract:
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation across diverse tabular data modalities, including static data, regular and irregular time series, data with censoring, multi-source data, composite data, and more. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synth…
▽ More
Synthcity is an open-source software package for innovative use cases of synthetic data in ML fairness, privacy and augmentation across diverse tabular data modalities, including static data, regular and irregular time series, data with censoring, multi-source data, composite data, and more. Synthcity provides the practitioners with a single access point to cutting edge research and tools in synthetic data. It also offers the community a playground for rapid experimentation and prototy**, a one-stop-shop for SOTA benchmarks, and an opportunity for extending research impact. The library can be accessed on GitHub (https://github.com/vanderschaarlab/synthcity) and pip (https://pypi.org/project/synthcity/). We warmly invite the community to join the development effort by providing feedback, reporting bugs, and contributing code.
△ Less
Submitted 18 January, 2023;
originally announced January 2023.
-
AutoPrognosis 2.0: Democratizing Diagnostic and Prognostic Modeling in Healthcare with Automated Machine Learning
Authors:
Fergus Imrie,
Bogdan Cebere,
Eoin F. McKinney,
Mihaela van der Schaar
Abstract:
Diagnostic and prognostic models are increasingly important in medicine and inform many clinical decisions. Recently, machine learning approaches have shown improvement over conventional modeling techniques by better capturing complex interactions between patient covariates in a data-driven manner. However, the use of machine learning introduces a number of technical and practical challenges that…
▽ More
Diagnostic and prognostic models are increasingly important in medicine and inform many clinical decisions. Recently, machine learning approaches have shown improvement over conventional modeling techniques by better capturing complex interactions between patient covariates in a data-driven manner. However, the use of machine learning introduces a number of technical and practical challenges that have thus far restricted widespread adoption of such techniques in clinical settings. To address these challenges and empower healthcare professionals, we present a machine learning framework, AutoPrognosis 2.0, to develop diagnostic and prognostic models. AutoPrognosis leverages state-of-the-art advances in automated machine learning to develop optimized machine learning pipelines, incorporates model explainability tools, and enables deployment of clinical demonstrators, without requiring significant technical expertise. Our framework eliminates the major technical obstacles to predictive modeling with machine learning that currently impede clinical adoption. To demonstrate AutoPrognosis 2.0, we provide an illustrative application where we construct a prognostic risk score for diabetes using the UK Biobank, a prospective study of 502,467 individuals. The models produced by our automated framework achieve greater discrimination for diabetes than expert clinical risk scores. Our risk score has been implemented as a web-based decision support tool and can be publicly accessed by patients and clinicians worldwide. In addition, AutoPrognosis 2.0 is provided as an open-source python package. By open-sourcing our framework as a tool for the community, clinicians and other medical practitioners will be able to readily develop new risk scores, personalized diagnostics, and prognostics using modern machine learning techniques.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection
Authors:
Daniel Jarrett,
Bogdan Cebere,
Tennison Liu,
Alicia Curth,
Mihaela van der Schaar
Abstract:
Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling be…
▽ More
Consider the problem of imputing missing values in a dataset. One the one hand, conventional approaches using iterative imputation benefit from the simplicity and customizability of learning conditional distributions directly, but suffer from the practical requirement for appropriate model specification of each and every variable. On the other hand, recent methods using deep generative modeling benefit from the capacity and efficiency of learning with neural network function approximators, but are often difficult to optimize and rely on stronger data assumptions. In this work, we study an approach that marries the advantages of both: We propose *HyperImpute*, a generalized iterative imputation framework for adaptively and automatically configuring column-wise models and their hyperparameters. Practically, we provide a concrete implementation with out-of-the-box learners, optimizers, simulators, and extensible interfaces. Empirically, we investigate this framework via comprehensive experiments and sensitivities on a variety of public datasets, and demonstrate its ability to generate accurate imputations relative to a strong suite of benchmarks. Contrary to recent work, we believe our findings constitute a strong defense of the iterative imputation paradigm.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Syft 0.5: A Platform for Universally Deployable Structured Transparency
Authors:
Adam James Hall,
Madhava Jay,
Tudor Cebere,
Bogdan Cebere,
Koen Lennart van der Veen,
George Muraru,
Tongye Xu,
Patrick Cason,
William Abramson,
Ayoub Benaissa,
Chinmay Shah,
Alan Aboudib,
Théo Ryffel,
Kritika Prakash,
Tom Titcombe,
Varun Kumar Khare,
Maddie Shang,
Ionesio Junior,
Animesh Gupta,
Jason Paumier,
Nahua Kang,
Vova Manannikov,
Andrew Trask
Abstract:
We present Syft 0.5, a general-purpose framework that combines a core group of privacy-enhancing technologies that facilitate a universal set of structured transparency systems. This framework is demonstrated through the design and implementation of a novel privacy-preserving inference information flow where we pass homomorphically encrypted activation signals through a split neural network for in…
▽ More
We present Syft 0.5, a general-purpose framework that combines a core group of privacy-enhancing technologies that facilitate a universal set of structured transparency systems. This framework is demonstrated through the design and implementation of a novel privacy-preserving inference information flow where we pass homomorphically encrypted activation signals through a split neural network for inference. We show that splitting the model further up the computation chain significantly reduces the computation time of inference and the payload size of activation signals at the cost of model secrecy. We evaluate our proposed flow with respect to its provision of the core structural transparency principles.
△ Less
Submitted 27 April, 2021; v1 submitted 26 April, 2021;
originally announced April 2021.
-
TenSEAL: A Library for Encrypted Tensor Operations Using Homomorphic Encryption
Authors:
Ayoub Benaissa,
Bilal Retiat,
Bogdan Cebere,
Alaa Eddine Belfedhal
Abstract:
Machine learning algorithms have achieved remarkable results and are widely applied in a variety of domains. These algorithms often rely on sensitive and private data such as medical and financial records. Therefore, it is vital to draw further attention regarding privacy threats and corresponding defensive techniques applied to machine learning models. In this paper, we present TenSEAL, an open-s…
▽ More
Machine learning algorithms have achieved remarkable results and are widely applied in a variety of domains. These algorithms often rely on sensitive and private data such as medical and financial records. Therefore, it is vital to draw further attention regarding privacy threats and corresponding defensive techniques applied to machine learning models. In this paper, we present TenSEAL, an open-source library for Privacy-Preserving Machine Learning using Homomorphic Encryption that can be easily integrated within popular machine learning frameworks. We benchmark our implementation using MNIST and show that an encrypted convolutional neural network can be evaluated in less than a second, using less than half a megabyte of communication.
△ Less
Submitted 28 April, 2021; v1 submitted 7 April, 2021;
originally announced April 2021.
-
Asymmetric Private Set Intersection with Applications to Contact Tracing and Private Vertical Federated Machine Learning
Authors:
Nick Angelou,
Ayoub Benaissa,
Bogdan Cebere,
William Clark,
Adam James Hall,
Michael A. Hoeh,
Daniel Liu,
Pavlos Papadopoulos,
Robin Roehm,
Robert Sandmann,
Phillipp Schoppmann,
Tom Titcombe
Abstract:
We present a multi-language, cross-platform, open-source library for asymmetric private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines traditional DDH-based PSI and PSI-C protocols with compression based on Bloom filters that helps reduce communication in the asymmetric setting. Currently, our library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and ru…
▽ More
We present a multi-language, cross-platform, open-source library for asymmetric private set intersection (PSI) and PSI-Cardinality (PSI-C). Our protocol combines traditional DDH-based PSI and PSI-C protocols with compression based on Bloom filters that helps reduce communication in the asymmetric setting. Currently, our library supports C++, C, Go, WebAssembly, JavaScript, Python, and Rust, and runs on both traditional hardware (x86) and browser targets. We further apply our library to two use cases: (i) a privacy-preserving contact tracing protocol that is compatible with existing approaches, but improves their privacy guarantees, and (ii) privacy-preserving machine learning on vertically partitioned data.
△ Less
Submitted 18 November, 2020;
originally announced November 2020.