Search | arXiv e-print repository

MarkerMap: nonlinear marker selection for single-cell studies

Authors: Nabeel Sarwar, Wilson Gregory, George A Kevrekidis, Soledad Villar, Bianca Dumitrascu

Abstract: Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable w… ▽ More Single-cell RNA-seq data allow the quantification of cell type differences across a growing set of biological contexts. However, pinpointing a small subset of genomic features explaining this variability can be ill-defined and computationally intractable. Here we introduce MarkerMap, a generative model for selecting minimal gene sets which are maximally informative of cell type origin and enable whole transcriptome reconstruction. MarkerMap provides a scalable framework for both supervised marker selection, aimed at identifying specific cell type populations, and unsupervised marker selection, aimed at gene expression imputation and reconstruction. We benchmark MarkerMap's competitive performance against previously published approaches on real single cell gene expression data sets. MarkerMap is available as a pip installable package, as a community resource aimed at develo** explainable machine learning techniques for enhancing interpretability in single-cell studies. △ Less

Submitted 28 July, 2022; originally announced July 2022.

arXiv:2204.00887 [pdf, other]

Dimensionless machine learning: Imposing exact units equivariance

Authors: Soledad Villar, Weichi Yao, David W. Hogg, Ben Blum-Smith, Bianca Dumitrascu

Abstract: Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly… ▽ More Units equivariance (or units covariance) is the exact symmetry that follows from the requirement that relationships among measured quantities of physics relevance must obey self-consistent dimensional scalings. Here, we express this symmetry in terms of a (non-compact) group action, and we employ dimensional analysis and ideas from equivariant machine learning to provide a methodology for exactly units-equivariant machine learning: For any given learning task, we first construct a dimensionless version of its inputs using classic results from dimensional analysis, and then perform inference in the dimensionless space. Our approach can be used to impose units equivariance across a broad range of machine learning methods which are equivariant to rotations and other groups. We discuss the in-sample and out-of-sample prediction accuracy gains one can obtain in contexts like symbolic regression and emulation, where symmetry is important. We illustrate our approach with simple numerical examples involving dynamical systems in physics and ecology. △ Less

Submitted 31 December, 2022; v1 submitted 2 April, 2022; originally announced April 2022.

Journal ref: Journal of Machine Learning Research 24 (2023) 1--32

arXiv:1910.05355 [pdf, other]

Nonparametric Bayesian multi-armed bandits for single cell experiment design

Authors: Federico Camerlenghi, Bianca Dumitrascu, Federico Ferrari, Barbara E. Engelhardt, Stefano Favaro

Abstract: The problem of maximizing cell type discovery under budget constraints is a fundamental challenge for the collection and analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale experiment for the collection of scRNA… ▽ More The problem of maximizing cell type discovery under budget constraints is a fundamental challenge for the collection and analysis of single-cell RNA-sequencing (scRNA-seq) data. In this paper, we introduce a simple, computationally efficient, and scalable Bayesian nonparametric sequential approach to optimize the budget allocation when designing a large scale experiment for the collection of scRNA-seq data for the purpose of, but not limited to, creating cell atlases. Our approach relies on the following tools: i) a hierarchical Pitman-Yor prior that recapitulates biological assumptions regarding cellular differentiation, and ii) a Thompson sampling multi-armed bandit strategy that balances exploitation and exploration to prioritize experiments across a sequence of trials. Posterior inference is performed by using a sequential Monte Carlo approach, which allows us to fully exploit the sequential nature of our species sampling problem. We empirically show that our approach outperforms state-of-the-art methods and achieves near-Oracle performance on simulated and scRNA-seq data alike. HPY-TS code is available at https://github.com/fedfer/HPYsinglecell. △ Less

Submitted 20 September, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

arXiv:1906.00226 [pdf, ps, other]

Patient-Specific Effects of Medication Using Latent Force Models with Gaussian Processes

Authors: Li-Fang Cheng, Bianca Dumitrascu, Michael Zhang, Corey Chivers, Michael Draugelis, Kai Li, Barbara E. Engelhardt

Abstract: Multi-output Gaussian processes (GPs) are a flexible Bayesian nonparametric framework that has proven useful in jointly modeling the physiological states of patients in medical time series data. However, capturing the short-term effects of drugs and therapeutic interventions on patient physiological state remains challenging. We propose a novel approach that models the effect of interventions as a… ▽ More Multi-output Gaussian processes (GPs) are a flexible Bayesian nonparametric framework that has proven useful in jointly modeling the physiological states of patients in medical time series data. However, capturing the short-term effects of drugs and therapeutic interventions on patient physiological state remains challenging. We propose a novel approach that models the effect of interventions as a hybrid Gaussian process composed of a GP capturing patient physiology convolved with a latent force model capturing effects of treatments on specific physiological features. This convolution of a multi-output GP with a GP including a causal time-marked kernel leads to a well-characterized model of the patients' physiological state responding to interventions. We show that our model leads to analytically tractable cross-covariance functions, allowing scalable inference. Our hierarchical model includes estimates of patient-specific effects but allows sharing of support across patients. Our approach achieves competitive predictive performance on challenging hospital data, where we recover patient-specific response to the administration of three common drugs: one antihypertensive drug and two anticoagulants. △ Less

Submitted 1 June, 2019; originally announced June 2019.

arXiv:1905.10003 [pdf, other]

doi 10.1109/TSP.2023.3267992

Sequential Gaussian Processes for Online Learning of Nonstationary Functions

Authors: Michael Minyi Zhang, Bianca Dumitrascu, Sinead A. Williamson, Barbara E. Engelhardt

Abstract: Many machine learning problems can be framed in the context of estimating functions, and often these are time-dependent functions that are estimated in real-time as observations arrive. Gaussian processes (GPs) are an attractive choice for modeling real-valued nonlinear functions due to their flexibility and uncertainty quantification. However, the typical GP regression model suffers from several… ▽ More Many machine learning problems can be framed in the context of estimating functions, and often these are time-dependent functions that are estimated in real-time as observations arrive. Gaussian processes (GPs) are an attractive choice for modeling real-valued nonlinear functions due to their flexibility and uncertainty quantification. However, the typical GP regression model suffers from several drawbacks: 1) Conventional GP inference scales $O(N^{3})$ with respect to the number of observations; 2) Updating a GP model sequentially is not trivial; and 3) Covariance kernels typically enforce stationarity constraints on the function, while GPs with non-stationary covariance kernels are often intractable to use in practice. To overcome these issues, we propose a sequential Monte Carlo algorithm to fit infinite mixtures of GPs that capture non-stationary behavior while allowing for online, distributed inference. Our approach empirically improves performance over state-of-the-art methods for online GP estimation in the presence of non-stationarity in time-series data. To demonstrate the utility of our proposed online Gaussian process mixture-of-experts approach in applied settings, we show that we can sucessfully implement an optimization algorithm using online Gaussian process bandits. △ Less

Submitted 6 May, 2023; v1 submitted 23 May, 2019; originally announced May 2019.

Journal ref: IEEE Transactions on Signal Processing, vol. 71, pp. 1539-1550, 2023

arXiv:1805.07458 [pdf, other]

PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits

Authors: Bianca Dumitrascu, Karen Feng, Barbara E Engelhardt

Abstract: We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal per… ▽ More We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards. Using a fast inference procedure with Polya-Gamma distributed augmentation variables, we propose an improved version of Thompson Sampling, a Bayesian formulation of contextual bandits with near-optimal performance. Our approach, Polya-Gamma augmented Thompson Sampling (PG-TS), achieves state-of-the-art performance on simulated and real data. PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret. Its explicit estimation of the posterior distribution of the context feature covariance leads to substantial empirical gains over approximate approaches. PG-TS is the first approach to demonstrate the benefits of Polya-Gamma augmentation in bandits and to propose an efficient Gibbs sampler for approximating the analytically unsolvable integral of logistic contextual bandits. △ Less

Submitted 18 May, 2018; originally announced May 2018.

arXiv:1703.09112 [pdf, other]

Sparse Multi-Output Gaussian Processes for Medical Time Series Prediction

Authors: Li-Fang Cheng, Gregory Darnell, Bianca Dumitrascu, Corey Chivers, Michael E Draugelis, Kai Li, Barbara E Engelhardt

Abstract: In the scenario of real-time monitoring of hospital patients, high-quality inference of patients' health status using all information available from clinical covariates and lab tests is essential to enable successful medical interventions and improve patient outcomes. Develo** a computational framework that can learn from observational large-scale electronic health records (EHRs) and make accura… ▽ More In the scenario of real-time monitoring of hospital patients, high-quality inference of patients' health status using all information available from clinical covariates and lab tests is essential to enable successful medical interventions and improve patient outcomes. Develo** a computational framework that can learn from observational large-scale electronic health records (EHRs) and make accurate real-time predictions is a critical step. In this work, we develop and explore a Bayesian nonparametric model based on Gaussian process (GP) regression for hospital patient monitoring. We propose MedGP, a statistical framework that incorporates 24 clinical and lab covariates and supports a rich reference data set from which relationships between observed covariates may be inferred and exploited for high-quality inference of patient state over time. To do this, we develop a highly structured sparse GP kernel to enable tractable computation over tens of thousands of time points while estimating correlations among clinical covariates, patients, and periodicity in patient observations. MedGP has a number of benefits over current methods, including (i) not requiring an alignment of the time series data, (ii) quantifying confidence regions in the predictions, (iii) exploiting a vast and rich database of patients, and (iv) inferring interpretable relationships among clinical covariates. We evaluate and compare results from MedGP on the task of online prediction for three patient subgroups from two medical data sets across 8,043 patients. We found MedGP improves online prediction over baseline methods for nearly all covariates across different disease subgroups and studies. The publicly available code is at https://github.com/bee-hive/MedGP. △ Less

Submitted 21 June, 2018; v1 submitted 27 March, 2017; originally announced March 2017.

Comments: Add new results and appendices

arXiv:1512.01616 [pdf, other]

A Bayesian test to identify variance effects

Authors: Bianca Dumitrascu, Gregory Darnell, Julien Ayroles, Barbara E Engelhardt

Abstract: Identifying genetic variants that regulate quantitative traits, or QTLs, is the primary focus of the field of statistical genetics. Most current methods are limited to identifying mean effects, or associations between genotype and the mean value of a quantitative trait. It is possible, however, that a genetic variant may affect the variance of the quantitative trait in lieu of, or in addition to,… ▽ More Identifying genetic variants that regulate quantitative traits, or QTLs, is the primary focus of the field of statistical genetics. Most current methods are limited to identifying mean effects, or associations between genotype and the mean value of a quantitative trait. It is possible, however, that a genetic variant may affect the variance of the quantitative trait in lieu of, or in addition to, affecting the trait mean. Here, we develop a general methodological approach to identifying covariates with variance effects on a quantitative trait using a Bayesian heteroskedastic linear regression model. We show that our Bayesian test for heteroskedasticity (BTH) outperforms classical tests for differences in variation across a large range of simulations drawn from scenarios common to the analysis of quantitative traits. We apply BTH to methylation QTL study data and expression QTL study data to identify variance QTLs. When compared with three tests for heteroskedasticity used in genomics, we illustrate the benefits of using our approach, including avoiding overfitting by incorporating uncertainty and flexibly identifying heteroskedastic effects. △ Less

Submitted 4 December, 2015; originally announced December 2015.

Showing 1–8 of 8 results for author: Dumitrascu, B