-
Answering open questions in biology using spatial genomics and structured methods
Authors:
Siddhartha G Jena,
Archit Verma,
Barbara E Engelhardt
Abstract:
Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shape, relative locations of, movement of, and interactions between cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of…
▽ More
Genomics methods have uncovered patterns in a range of biological systems, but obscure important aspects of cell behavior: the shape, relative locations of, movement of, and interactions between cells in space. Spatial technologies that collect genomic or epigenomic data while preserving spatial information have begun to overcome these limitations. These new data promise a deeper understanding of the factors that affect cellular behavior, and in particular the ability to directly test existing theories about cell state and variation in the context of morphology, location, motility, and signaling that could not be tested before. Rapid advancements in resolution, ease-of-use, and scale of spatial genomics technologies to address these questions also require an updated toolkit of statistical methods with which to interrogate these data. We present four open biological questions that can now be answered using spatial genomics data paired with methods for analysis. We outline spatial data modalities for each that may yield specific insight, discuss how conflicting theories may be tested by comparing the data to conceptual models of biological behavior, and highlight statistical and machine learning-based tools that may prove particularly helpful to recover biological insight.
△ Less
Submitted 14 October, 2023;
originally announced October 2023.
-
SynthA1c: Towards Clinically Interpretable Patient Representations for Diabetes Risk Stratification
Authors:
Michael S. Yao,
Allison Chae,
Matthew T. MacLean,
Anurag Verma,
Jeffrey Duda,
James Gee,
Drew A. Torigian,
Daniel Rader,
Charles Kahn,
Walter R. Witschey,
Hersh Sagreiya
Abstract:
Early diagnosis of Type 2 Diabetes Mellitus (T2DM) is crucial to enable timely therapeutic interventions and lifestyle modifications. As the time available for clinical office visits shortens and medical imaging data become more widely available, patient image data could be used to opportunistically identify patients for additional T2DM diagnostic workup by physicians. We investigated whether imag…
▽ More
Early diagnosis of Type 2 Diabetes Mellitus (T2DM) is crucial to enable timely therapeutic interventions and lifestyle modifications. As the time available for clinical office visits shortens and medical imaging data become more widely available, patient image data could be used to opportunistically identify patients for additional T2DM diagnostic workup by physicians. We investigated whether image-derived phenotypic data could be leveraged in tabular learning classifier models to predict T2DM risk in an automated fashion to flag high-risk patients without the need for additional blood laboratory measurements. In contrast to traditional binary classifiers, we leverage neural networks and decision tree models to represent patient data as 'SynthA1c' latent variables, which mimic blood hemoglobin A1c empirical lab measurements, that achieve sensitivities as high as 87.6%. To evaluate how SynthA1c models may generalize to other patient populations, we introduce a novel generalizable metric that uses vanilla data augmentation techniques to predict model performance on input out-of-domain covariates. We show that image-derived phenotypes and physical examination data together can accurately predict diabetes risk as a means of opportunistic risk stratification enabled by artificial intelligence and medical imaging. Our code is available at https://github.com/allisonjchae/DMT2RiskAssessment.
△ Less
Submitted 27 July, 2023; v1 submitted 20 September, 2022;
originally announced September 2022.
-
Modeling electronic health record data using a knowledge-graph-embedded topic model
Authors:
Yuesong Zou,
Ahmad Pesaranghader,
Aman Verma,
David Buckeridge,
Yue Li
Abstract:
The rapid growth of electronic health record (EHR) datasets opens up promising opportunities to understand human diseases in a systematic way. However, effective extraction of clinical knowledge from the EHR data has been hindered by its sparsity and noisy information. We present KG-ETM, an end-to-end knowledge graph-based multimodal embedded topic model. KG-ETM distills latent disease topics from…
▽ More
The rapid growth of electronic health record (EHR) datasets opens up promising opportunities to understand human diseases in a systematic way. However, effective extraction of clinical knowledge from the EHR data has been hindered by its sparsity and noisy information. We present KG-ETM, an end-to-end knowledge graph-based multimodal embedded topic model. KG-ETM distills latent disease topics from EHR data by learning the embedding from the medical knowledge graphs. We applied KG-ETM to a large-scale EHR dataset consisting of over 1 million patients. We evaluated its performance based on EHR reconstruction and drug imputation. KG-ETM demonstrated superior performance over the alternative methods on both tasks. Moreover, our model learned clinically meaningful graph-informed embedding of the EHR codes. In additional, our model is also able to discover interpretable and accurate patient representations for patient stratification and drug recommendations.
△ Less
Submitted 3 June, 2022;
originally announced June 2022.
-
Inferring the shape of data: A probabilistic framework for analyzing experiments in the natural sciences
Authors:
Korak Kumar Ray,
Anjali R. Verma,
Ruben L. Gonzalez Jr,
Colin D. Kinz-Thompson
Abstract:
A critical step in data analysis for many different types of experiments is the identification of features with theoretically defined shapes in N-dimensional datasets; examples of this process include finding peaks in multi-dimensional molecular spectra or emitters in fluorescence microscopy images. Identifying such features involves determining if the overall shape of the data is consistent with…
▽ More
A critical step in data analysis for many different types of experiments is the identification of features with theoretically defined shapes in N-dimensional datasets; examples of this process include finding peaks in multi-dimensional molecular spectra or emitters in fluorescence microscopy images. Identifying such features involves determining if the overall shape of the data is consistent with an expected shape, however, it is generally unclear how to quantitatively make this determination. In practice, many analysis methods employ subjective, heuristic approaches, which complicates the validation of any ensuing results - especially as the amount and dimensionality of the data increase. Here, we present a probabilistic solution to this problem by using Bayes' rule to calculate the probability that the data has any one of several potential shapes. This probabilistic approach may be used to objectively compare how well different theories describe a dataset, identify changes between datasets, and detect features within data using a corollary method called Bayesian Inference-based Template Search (BITS); several proof-of-principle examples are provided. Altogether, this mathematical framework serves as an automated 'engine' capable of computationally executing analysis decisions currently made by visual inspection across the sciences.
△ Less
Submitted 24 August, 2022; v1 submitted 25 September, 2021;
originally announced September 2021.
-
Predicting 3D RNA Folding Patterns via Quadratic Binary Optimization
Authors:
Mark W. Lewis,
Amit Verma,
Rick Hennig
Abstract:
The structure of an RNA molecule plays a significant role in its biological function. Predicting structure given a one dimensional sequence of RNA nucleotide bases is a difficult and important problem. Many computer programs (known as in silico) are available for predicting 2-dimensional (secondary) structures however 3-dimensional (tertiary) structure prediction is much more difficult mainly due…
▽ More
The structure of an RNA molecule plays a significant role in its biological function. Predicting structure given a one dimensional sequence of RNA nucleotide bases is a difficult and important problem. Many computer programs (known as in silico) are available for predicting 2-dimensional (secondary) structures however 3-dimensional (tertiary) structure prediction is much more difficult mainly due to the far greater number of feasible solutions and fewer experimental data on the thermodynamic energies of 3D structures. It is also challenging to verify the most likely three dimensional structure even with the availability of sophisticated x-ray crystallography and nuclear magnetic resonance imaging technologies. In this paper we develop three dimensional RNA folding predictions by adding penalty and reward parameters to a previous two dimensional approach based on Quadratic Unconstrained Binary Optimization (QUBO) models. These parameters provide flexibility in the amount of three dimensional folding allowed. We address the problem of multiple near-optimal structures via a new weighted similarity structure measure and illustrate folding pathways via progressively improving local optimal solutions. The problems are solved via a new commercial QUBO solver AlphaQUBO (Meta-Analytics, 2020) that solves problems having hundreds of thousands of binary variables.
△ Less
Submitted 14 June, 2021;
originally announced June 2021.
-
Supervised multi-specialist topic model with applications on large-scale electronic health record data
Authors:
Ziyang Song,
Xavier Sumba Toral,
Yixin Xu,
Aihua Liu,
Liming Guo,
Guido Powell,
Aman Verma,
David Buckeridge,
Ariane Marelli,
Yue Li
Abstract:
Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments a…
▽ More
Motivation: Electronic health record (EHR) data provides a new venue to elucidate disease comorbidities and latent phenotypes for precision medicine. To fully exploit its potential, a realistic data generative process of the EHR data needs to be modelled. We present MixEHR-S to jointly infer specialist-disease topics from the EHR data. As the key contribution, we model the specialist assignments and ICD-coded diagnoses as the latent topics based on patient's underlying disease topic mixture in a novel unified supervised hierarchical Bayesian topic model. For efficient inference, we developed a closed-form collapsed variational inference algorithm to learn the model distributions of MixEHR-S. We applied MixEHR-S to two independent large-scale EHR databases in Quebec with three targeted applications: (1) Congenital Heart Disease (CHD) diagnostic prediction among 154,775 patients; (2) Chronic obstructive pulmonary disease (COPD) diagnostic prediction among 73,791 patients; (3) future insulin treatment prediction among 78,712 patients diagnosed with diabetes as a mean to assess the disease exacerbation. In all three applications, MixEHR-S conferred clinically meaningful latent topics among the most predictive latent topics and achieved superior target prediction accuracy compared to the existing methods, providing opportunities for prioritizing high-risk patients for healthcare services. MixEHR-S source code and scripts of the experiments are freely available at https://github.com/li-lab-mcgill/mixehrS
△ Less
Submitted 3 May, 2021;
originally announced May 2021.
-
Network reinforcement driven drug repurposing for COVID-19 by exploiting disease-gene-drug associations
Authors:
Yonghyun Nam,
Jae-Seung Yun,
Seung Mi Lee,
Ji Won Park,
Ziqi Chen,
Brian Lee,
Anurag Verma,
Xia Ning,
Li Shen,
Dokyoon Kim
Abstract:
Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for develo** treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already…
▽ More
Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for develo** treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already been in place or will be performed at clinical sites in the near future. Additionally, patients with comorbidities such as diabetes mellitus, obesity, liver cirrhosis, kidney diseases, hypertension, and asthma are at higher risk for severe illness from COVID-19. Thus, the relationship of comorbidity disease with COVID-19 may help to find repurposable drugs. To reduce trial and error in finding treatments for COVID-19, we propose building a network-based drug repurposing framework to prioritize repurposable drugs. First, we utilized knowledge of COVID-19 to construct a disease-gene-drug network (DGDr-Net) representing a COVID-19-centric interactome with components for diseases, genes, and drugs. DGDr-Net consisted of 592 diseases, 26,681 human genes and 2,173 drugs, and medical information for 18 common comorbidities. The DGDr-Net recommended candidate repurposable drugs for COVID-19 through network reinforcement driven scoring algorithms. The scoring algorithms determined the priority of recommendations by utilizing graph-based semi-supervised learning. From the predicted scores, we recommended 30 drugs, including dexamethasone, resveratrol, methotrexate, indomethacin, quercetin, etc., as repurposable drugs for COVID-19, and the results were verified with drugs that have been under clinical trials. The list of drugs via a data-driven computational approach could help reduce trial-and-error in finding treatment for COVID-19.
△ Less
Submitted 12 August, 2020;
originally announced August 2020.
-
Modeling disease progression in longitudinal EHR data using continuous-time hidden Markov models
Authors:
Aman Verma,
Guido Powell,
Yu Luo,
David Stephens,
David L. Buckeridge
Abstract:
Modeling disease progression in healthcare administrative databases is complicated by the fact that patients are observed only at irregular intervals when they seek healthcare services. In a longitudinal cohort of 76,888 patients with chronic obstructive pulmonary disease (COPD), we used a continuous-time hidden Markov model with a generalized linear model to model healthcare utilization events. W…
▽ More
Modeling disease progression in healthcare administrative databases is complicated by the fact that patients are observed only at irregular intervals when they seek healthcare services. In a longitudinal cohort of 76,888 patients with chronic obstructive pulmonary disease (COPD), we used a continuous-time hidden Markov model with a generalized linear model to model healthcare utilization events. We found that the fitted model provides interpretable results suitable for summarization and hypothesis generation.
△ Less
Submitted 2 December, 2018;
originally announced December 2018.