-
A primer on synthetic health data
Authors:
Jennifer A Bartell,
Sander Boisen Valentin,
Anders Krogh,
Henning Langberg,
Martin Bøgsted
Abstract:
Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets. These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions derived from sensitive health datasets without disclosing patient identity or sensitive information. Thus, synthetic data can facilitate safe data sharing that supports…
▽ More
Recent advances in deep generative models have greatly expanded the potential to create realistic synthetic health datasets. These synthetic datasets aim to preserve the characteristics, patterns, and overall scientific conclusions derived from sensitive health datasets without disclosing patient identity or sensitive information. Thus, synthetic data can facilitate safe data sharing that supports a range of initiatives including the development of new predictive models, advanced health IT platforms, and general project ideation and hypothesis development. However, many questions and challenges remain, including how to consistently evaluate a synthetic dataset's similarity and predictive utility in comparison to the original real dataset and risk to privacy when shared. Additional regulatory and governance issues have not been widely addressed. In this primer, we map the state of synthetic health data, including generation and evaluation methods and tools, existing examples of deployment, the regulatory and ethical landscape, access and governance options, and opportunities for further development.
△ Less
Submitted 3 July, 2024; v1 submitted 31 January, 2024;
originally announced January 2024.
-
The Deep Generative Decoder: MAP estimation of representations improves modeling of single-cell RNA data
Authors:
Viktoria Schuster,
Anders Krogh
Abstract:
Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models such as variational autoencoders (VAEs) which use a variational approximation of the likelihood for inference. We here present the Deep Generative Decoder (DGD), a simple generative model that computes mod…
▽ More
Learning low-dimensional representations of single-cell transcriptomics has become instrumental to its downstream analysis. The state of the art is currently represented by neural network models such as variational autoencoders (VAEs) which use a variational approximation of the likelihood for inference. We here present the Deep Generative Decoder (DGD), a simple generative model that computes model parameters and representations directly via maximum a posteriori (MAP) estimation. The DGD handles complex parameterized latent distributions naturally unlike VAEs which typically use a fixed Gaussian distribution, because of the complexity of adding other types. We first show its general functionality on a commonly used benchmark set, Fashion-MNIST. Secondly, we apply the model to multiple single-cell data sets. Here the DGD learns low-dimensional, meaningful and well-structured latent representations with sub-clustering beyond the provided labels. The advantages of this approach are its simplicity and its capability to provide representations of much smaller dimensionality than a comparable VAE.
△ Less
Submitted 12 July, 2023; v1 submitted 13 October, 2021;
originally announced October 2021.
-
A manifold learning perspective on representation learning: Learning decoder and representations without an encoder
Authors:
Viktoria Schuster,
Anders Krogh
Abstract:
Autoencoders are commonly used in representation learning. They consist of an encoder and a decoder, which provide a straightforward way to map n-dimensional data in input space to a lower m-dimensional representation space and back. The decoder itself defines an m-dimensional manifold in input space. Inspired by manifold learning, we show that the decoder can be trained on its own by learning the…
▽ More
Autoencoders are commonly used in representation learning. They consist of an encoder and a decoder, which provide a straightforward way to map n-dimensional data in input space to a lower m-dimensional representation space and back. The decoder itself defines an m-dimensional manifold in input space. Inspired by manifold learning, we show that the decoder can be trained on its own by learning the representations of the training samples along with the decoder weights using gradient descent. A sum-of-squares loss then corresponds to optimizing the manifold to have the smallest Euclidean distance to the training samples, and similarly for other loss functions. We derive expressions for the number of samples needed to specify the encoder and decoder and show that the decoder generally requires much less training samples to be well-specified compared to the encoder. We discuss training of autoencoders in this perspective and relate to previous work in the field that use noisy training examples and other types of regularization. On the natural image data sets MNIST and CIFAR10, we demonstrate that the decoder is much better suited to learn a low-dimensional representation, especially when trained on small data sets. Using simulated gene regulatory data, we further show that the decoder alone leads to better generalization and meaningful representations. Our approach of training the decoder alone facilitates representation learning even on small data sets and can lead to improved training of autoencoders. We hope that the simple analyses presented will also contribute to an improved conceptual understanding of representation learning.
△ Less
Submitted 13 October, 2021; v1 submitted 31 August, 2021;
originally announced August 2021.