Search | arXiv e-print repository

Implications of map**s between ICD clinical diagnosis codes and Human Phenotype Ontology terms

Authors: Amelia LM Tan, Rafael S Gonçalves, William Yuan, Gabriel A Brat, The Consortium for Clinical Characterization of COVID-19 by EHR, Robert Gentleman, Isaac S Kohane

Abstract: Objective: Integrating EHR data with other resources is essential in rare disease research due to low disease prevalence. Such integration is dependent on the alignment of ontologies used for data annotation. The International Classification of Diseases (ICD) is used to annotate clinical diagnoses; the Human Phenotype Ontology (HPO) to annotate phenotypes. Although these ontologies overlap in biom… ▽ More Objective: Integrating EHR data with other resources is essential in rare disease research due to low disease prevalence. Such integration is dependent on the alignment of ontologies used for data annotation. The International Classification of Diseases (ICD) is used to annotate clinical diagnoses; the Human Phenotype Ontology (HPO) to annotate phenotypes. Although these ontologies overlap in biomedical entities described, the extent to which they are interoperable is unknown. We investigate how well aligned these ontologies are and whether such alignments facilitate EHR data integration. Materials and Methods: We conducted an empirical analysis of the coverage of map**s between ICD and HPO. We interpret this map** coverage as a proxy for how easily clinical data can be integrated with research ontologies such as HPO. We quantify how exhaustively ICD codes are mapped to HPO by analyzing map**s in the UMLS Metathesaurus. We analyze the proportion of ICD codes mapped to HPO within a real-world EHR dataset. Results and Discussion: Our analysis revealed that only 2.2% of ICD codes have direct map**s to HPO in UMLS. Within our EHR dataset, less than 50% of ICD codes have map**s to HPO terms. ICD codes that are used frequently in EHR data tend to have map**s to HPO; ICD codes that represent rarer medical conditions are seldom mapped. Conclusion: We find that interoperability between ICD and HPO via UMLS is limited. While other map** sources could be incorporated, there are no established conventions for what resources should be used to complement UMLS. △ Less

Submitted 11 July, 2024; originally announced July 2024.

arXiv:2306.11547 [pdf, other]

Event Stream GPT: A Data Pre-processing and Modeling Library for Generative, Pre-trained Transformers over Continuous-time Sequences of Complex Events

Authors: Matthew B. A. McDermott, Bret Nestor, Peniel Argaw, Isaac Kohane

Abstract: Generative, pre-trained transformers (GPTs, a.k.a. "Foundation Models") have reshaped natural language processing (NLP) through their versatility in diverse downstream tasks. However, their potential extends far beyond NLP. This paper provides a software utility to help realize this potential, extending the applicability of GPTs to continuous-time sequences of complex events with internal dependen… ▽ More Generative, pre-trained transformers (GPTs, a.k.a. "Foundation Models") have reshaped natural language processing (NLP) through their versatility in diverse downstream tasks. However, their potential extends far beyond NLP. This paper provides a software utility to help realize this potential, extending the applicability of GPTs to continuous-time sequences of complex events with internal dependencies, such as medical record datasets. Despite their potential, the adoption of foundation models in these domains has been hampered by the lack of suitable tools for model construction and evaluation. To bridge this gap, we introduce Event Stream GPT (ESGPT), an open-source library designed to streamline the end-to-end process for building GPTs for continuous-time event sequences. ESGPT allows users to (1) build flexible, foundation-model scale input datasets by specifying only a minimal configuration file, (2) leverage a Hugging Face compatible modeling API for GPTs over this modality that incorporates intra-event causal dependency structures and autoregressive generation capabilities, and (3) evaluate models via standardized processes that can assess few and even zero-shot performance of pre-trained models on user-specified fine-tuning tasks. △ Less

Submitted 21 June, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2212.10320 [pdf]

Construction of extra-large scale screening tools for risks of severe mental illnesses using real world healthcare data

Authors: Dianbo Liu, Karmel W. Choi, Paulo Lizano, William Yuan, Kun-Hsing Yu, Jordan W. Smoller, Isaac Kohane

Abstract: Importance: The prevalence of severe mental illnesses (SMIs) in the United States is approximately 3% of the whole population. The ability to conduct risk screening of SMIs at large scale could inform early prevention and treatment. Objective: A scalable machine learning based tool was developed to conduct population-level risk screening for SMIs, including schizophrenia, schizoaffective disorde… ▽ More Importance: The prevalence of severe mental illnesses (SMIs) in the United States is approximately 3% of the whole population. The ability to conduct risk screening of SMIs at large scale could inform early prevention and treatment. Objective: A scalable machine learning based tool was developed to conduct population-level risk screening for SMIs, including schizophrenia, schizoaffective disorders, psychosis, and bipolar disorders,using 1) healthcare insurance claims and 2) electronic health records (EHRs). Design, setting and participants: Data from beneficiaries from a nationwide commercial healthcare insurer with 77.4 million members and data from patients from EHRs from eight academic hospitals based in the U.S. were used. First, the predictive models were constructed and tested using data in case-control cohorts from insurance claims or EHR data. Second, performance of the predictive models across data sources were analyzed. Third, as an illustrative application, the models were further trained to predict risks of SMIs among 18-year old young adults and individuals with substance associated conditions. Main outcomes and measures: Machine learning-based predictive models for SMIs in the general population were built based on insurance claims and EHR. △ Less

Submitted 12 January, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

arXiv:2212.01437 [pdf, other]

Identifying Heterogeneous Treatment Effects in Multiple Outcomes using Joint Confidence Intervals

Authors: Peniel N. Argaw, Elizabeth Healey, Isaac S. Kohane

Abstract: Heterogeneous treatment effects (HTEs) are commonly identified during randomized controlled trials (RCTs). Identifying subgroups of patients with similar treatment effects is of high interest in clinical research to advance precision medicine. Often, multiple clinical outcomes are measured during an RCT, each having a potentially heterogeneous effect. Recently there has been high interest in ident… ▽ More Heterogeneous treatment effects (HTEs) are commonly identified during randomized controlled trials (RCTs). Identifying subgroups of patients with similar treatment effects is of high interest in clinical research to advance precision medicine. Often, multiple clinical outcomes are measured during an RCT, each having a potentially heterogeneous effect. Recently there has been high interest in identifying subgroups from HTEs, however, there has been less focus on develo** tools in settings where there are multiple outcomes. In this work, we propose a framework for partitioning the covariate space to identify subgroups across multiple outcomes based on the joint CIs. We test our algorithm on synthetic and semi-synthetic data where there are two outcomes, and demonstrate that our algorithm is able to capture the HTE in both outcomes simultaneously. △ Less

Submitted 2 December, 2022; originally announced December 2022.

Comments: Accepted to ML4H 2022. Available at https://proceedings.mlr.press/v193/argaw22a.html

Journal ref: Proceedings of Machine Learning Research 193 (2022) 141-170

arXiv:1911.10241 [pdf, other]

Cross-modal representation alignment of molecular structure and perturbation-induced transcriptional profiles

Authors: Samuel G. Finlayson, Matthew B. A. McDermott, Alex V. Pickering, Scott L. Lipnick, Isaac S. Kohane

Abstract: Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structu… ▽ More Modeling the relationship between chemical structure and molecular activity is a key goal in drug development. Many benchmark tasks have been proposed for molecular property prediction, but these tasks are generally aimed at specific, isolated biomedical properties. In this work, we propose a new cross-modal small molecule retrieval task, designed to force a model to learn to associate the structure of a small molecule with the transcriptional change it induces. We develop this task formally as multi-view alignment problem, and present a coordinated deep learning approach that jointly optimizes representations of both chemical structure and perturbational gene expression profiles. We benchmark our results against oracle models and principled baselines, and find that cell line variability markedly influences performance in this domain. Our work establishes the feasibility of this new task, elucidates the limitations of current data and systems, and may serve to catalyze future research in small molecule representation learning. △ Less

Submitted 1 October, 2020; v1 submitted 22 November, 2019; originally announced November 2019.

Comments: Accepted for oral presentation at the Pacific Symposium of Biocomputing, 2021

arXiv:1812.01547 [pdf, other]

Towards generative adversarial networks as a new paradigm for radiology education

Authors: Samuel G. Finlayson, Hyunkwang Lee, Isaac S. Kohane, Luke Oakden-Rayner

Abstract: Medical students and radiology trainees typically view thousands of images in order to "train their eye" to detect the subtle visual patterns necessary for diagnosis. Nevertheless, infrastructural and legal constraints often make it difficult to access and quickly query an abundance of images with a user-specified feature set. In this paper, we use a conditional generative adversarial network (GAN… ▽ More Medical students and radiology trainees typically view thousands of images in order to "train their eye" to detect the subtle visual patterns necessary for diagnosis. Nevertheless, infrastructural and legal constraints often make it difficult to access and quickly query an abundance of images with a user-specified feature set. In this paper, we use a conditional generative adversarial network (GAN) to synthesize $1024\times1024$ pixel pelvic radiographs that can be queried with conditioning on fracture status. We demonstrate that the conditional GAN learns features that distinguish fractures from non-fractures by training a convolutional neural network exclusively on images sampled from the GAN and achieving an AUC of $>0.95$ on a held-out set of real images. We conduct additional analysis of the images sampled from the GAN and describe ongoing work to validate educational efficacy. △ Less

Submitted 4 December, 2018; originally announced December 2018.

Comments: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:cs/0101200

Report number: ML4H/2018/224

arXiv:1811.01294 [pdf, other]

Learning Contextual Hierarchical Structure of Medical Concepts with Poincairé Embeddings to Clarify Phenotypes

Authors: Brett K. Beaulieu-Jones, Isaac S. Kohane, Andrew L. Beam

Abstract: Biomedical association studies are increasingly done using clinical concepts, and in particular diagnostic codes from clinical data repositories as phenotypes. Clinical concepts can be represented in a meaningful, vector space using word embedding models. These embeddings allow for comparison between clinical concepts or for straightforward input to machine learning models. Using traditional appro… ▽ More Biomedical association studies are increasingly done using clinical concepts, and in particular diagnostic codes from clinical data repositories as phenotypes. Clinical concepts can be represented in a meaningful, vector space using word embedding models. These embeddings allow for comparison between clinical concepts or for straightforward input to machine learning models. Using traditional approaches, good representations require high dimensionality, making downstream tasks such as visualization more difficult. We applied Poincaré embeddings in a 2-dimensional hyperbolic space to a large-scale administrative claims database and show performance comparable to 100-dimensional embeddings in a euclidean space. We then examine disease relationships under different disease contexts to better understand potential phenotypes. △ Less

Submitted 3 November, 2018; originally announced November 2018.

Comments: To appear in 2019 Pacific Symposium on Biocomputing

arXiv:1804.05296 [pdf, other]

Adversarial Attacks Against Medical Deep Learning Systems

Authors: Samuel G. Finlayson, Hyung Won Chung, Isaac S. Kohane, Andrew L. Beam

Abstract: The discovery of adversarial examples has raised concerns about the practical deployment of deep learning systems. In this paper, we demonstrate that adversarial examples are capable of manipulating deep learning systems across three clinical domains. For each of our representative medical deep learning classifiers, both white and black box attacks were highly successful. Our models are representa… ▽ More The discovery of adversarial examples has raised concerns about the practical deployment of deep learning systems. In this paper, we demonstrate that adversarial examples are capable of manipulating deep learning systems across three clinical domains. For each of our representative medical deep learning classifiers, both white and black box attacks were highly successful. Our models are representative of the current state of the art in medical computer vision and, in some cases, directly reflect architectures already seeing deployment in real world clinical settings. In addition to the technical contribution of our paper, we synthesize a large body of knowledge about the healthcare system to argue that medicine may be uniquely susceptible to adversarial attacks, both in terms of monetary incentives and technical vulnerability. To this end, we outline the healthcare economy and the incentives it creates for fraud and provide concrete examples of how and why such attacks could be realistically carried out. We urge practitioners to be aware of current vulnerabilities when deploying deep learning systems in clinical settings, and encourage the machine learning community to further investigate the domain-specific characteristics of medical learning systems. △ Less

Submitted 4 February, 2019; v1 submitted 14 April, 2018; originally announced April 2018.

arXiv:1804.02097 [pdf, other]

Multi-view Banded Spectral Clustering with Application to ICD9 Clustering

Authors: Luwan Zhang, Katherine Liao, Issac Kohane, Tianxi Cai

Abstract: Despite recent development in methodology, community detection remains a challenging problem. Existing literature largely focuses on the standard setting where a network is learned using an observed adjacency matrix from a single data source. Constructing a shared network from multiple data sources is more challenging due to the heterogeneity across populations. Additionally, no existing method le… ▽ More Despite recent development in methodology, community detection remains a challenging problem. Existing literature largely focuses on the standard setting where a network is learned using an observed adjacency matrix from a single data source. Constructing a shared network from multiple data sources is more challenging due to the heterogeneity across populations. Additionally, no existing method leverages the prior distance knowledge available in many domains to help the discovery of the network structure. To bridge this gap, in this paper we propose a novel spectral clustering method that optimally combines multiple data sources while leveraging the prior distance knowledge. The proposed method combines a banding step guided by the distance knowledge with a subsequent weighting step to maximize consensus across multiple sources. Its statistical performance is thoroughly studied under a multi-view stochastic block model. We also provide a simple yet optimal rule of choosing weights in practice. The efficacy and robustness of the method is fully demonstrated through extensive simulations. Finally, we apply the method to cluster the International classification of diseases, ninth revision (ICD9), codes and yield a very insightful clustering structure by integrating information from a large claim database and two healthcare systems. △ Less

Submitted 20 June, 2018; v1 submitted 5 April, 2018; originally announced April 2018.

arXiv:1804.01486 [pdf, other]

Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data

Authors: Andrew L. Beam, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin Weber, Nathan P. Palmer, Xu Shi, Tianxi Cai, Isaac S. Kohane

Abstract: Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a col… ▽ More Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. To evaluate our approach, we present a new benchmark methodology based on statistical power specifically designed to test embeddings of medical concepts. Our approach, called cui2vec, attains state-of-the-art performance relative to previous methods in most instances. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings △ Less

Submitted 19 August, 2019; v1 submitted 4 April, 2018; originally announced April 2018.

arXiv:1804.00735 [pdf, other]

A Fast Divide-and-Conquer Sparse Cox Regression

Authors: Yan Wang, Nathan Palmer, Qian Di, Joel Schwartz, Isaac Kohane, Tianxi Cai

Abstract: We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the part… ▽ More We propose a computationally and statistically efficient divide-and-conquer (DAC) algorithm to fit sparse Cox regression to massive datasets where the sample size $n_0$ is exceedingly large and the covariate dimension $p$ is not small but $n_0\gg p$. The proposed algorithm achieves computational efficiency through a one-step linear approximation followed by a least square approximation to the partial likelihood (PL). These sequences of linearization enable us to maximize the PL with only a small subset and perform penalized estimation via a fast approximation to the PL. The algorithm is applicable for the analysis of both time-independent and time-dependent survival data. Simulations suggest that the proposed DAC algorithm substantially outperforms the full sample-based estimators and the existing DAC algorithm with respect to the computational speed, while it achieves similar statistical efficiency as the full sample-based estimators. The proposed algorithm was applied to an extraordinarily large time-independent survival dataset and an extraordinarily large time-dependent survival dataset for the prediction of heart failure-specific readmission within 30 days among Medicare heart failure patients. △ Less

Submitted 2 April, 2018; originally announced April 2018.

arXiv:1710.03613 [pdf]

Auditory Brainstem Response in Infants and Children with Autism: A Meta-Analysis

Authors: Oren Miron, Andrew L. Beam, Isaac S. Kohane

Abstract: Infants with autism were recently found to have prolonged Auditory Brainstem Response (ABR); however, at older ages, findings are contradictory. We compared ABR differences between participants with autism and controls with respect to age using a meta-analysis. Data sources included MEDLINE, EMBASE, Web of Science, Google Scholar, HOLLIS and ScienceDirect from their inception to June 2016. The 25… ▽ More Infants with autism were recently found to have prolonged Auditory Brainstem Response (ABR); however, at older ages, findings are contradictory. We compared ABR differences between participants with autism and controls with respect to age using a meta-analysis. Data sources included MEDLINE, EMBASE, Web of Science, Google Scholar, HOLLIS and ScienceDirect from their inception to June 2016. The 25 studies that were included had a total of 1349 participants (727 participants with autism and 622 controls) and an age range of 0-40 years. Prolongation of wave V in autism had a significant negative correlation with age (R2=0.23; P=.01). The 22 studies below age 18 years showed a significantly prolonged wave V in autism (Standard Mean Difference=0.6 [95% CI, 0.5 to 0.8]; P<.001). The 3 studies above 18 years of age showed a significantly shorter wave V in autism (SMD=-0.6 [95% CI, -1.0 to -0.2]; P=.004). Prolonged ABR was consistent in infants and children with autism, suggesting it can serve as an autism biomarker at infancy. As the ABR is routinely used to screen infants for hearing impairment, the opportunity for replication studies is extensive. △ Less

Submitted 10 October, 2017; originally announced October 2017.

Showing 1–12 of 12 results for author: Kohane, I